[kepler-dev] [Ptolemy] Re: [Bug 3693] - MultiInstanceComposite actor deadlocks sometimes

Fri Dec 19 15:42:26 PST 2008

Yup, I found a similar case too...
Turns out I need to offer more fine-grain access to the
Workspace capabilities.  Sometime, Workspace.wait(Object)
won't do... Instead, I have to do something like this:

      *    int depth = 0;
      *    try {
      *       synchrononized(obj) {
      *           ...
      *           depth = releaseReadPermission();
      *           obj.wait();
      *        }
      *    } finally {
      *       if (depth > 0) {
      *          reacquireReadPermission(depth);
      *       }
      *    }

The reason is that if the obj.wait() occurs outside
the synchronized block, there is a chance it will miss
the notification it is waiting for...

I've got a fix in my tree on which I haven't been able to
get Christopher's example to fail...

I'm going to run more tests, but I'm reasonably confident
I've got this fixed.

Threads really and truly suck...

Edward

Bert Rodiers wrote:
> Hello Edward,
> 
> I tried the model Christopher made and I end up in another deadlock, 
> where one ProcessThread has synchronized with director and wants write 
> access on the workspace and the manager thread has a readlock on the 
> workspace and wants to synchronize with the director without the 
> readlock being released. So a typical case in which 2 threads want to 
> have 2 different resources and acquire them in a different order.
> 
> Below the detailed description:
> 
> The first thread is the Manager thread. At some time it will call wrapup()
> Manager.wrapup() will call
>         // Wrap up the topology
>         _container.wrapup();
> CompositeActor wrapup will get readaccess on the workspace: 
> _workspace.getReadAccess();
> and afterwards it will call director.wrapup();
> ProcessDirector.wrapup will call _requestFinishOnReceivers();
> which leads to nextReceiver.requestFinish();
> PNQueueReceiver.requestFinish() will synchronise with the workspace
> synchronized (_director) {
> This is were this thread is hanging.
> This readlock is not released since the workspace wait method that takes 
> an object has not been used.
> 
> The second thread  is a ProcessThread (the one for the 
> MultiInstanceComposite actor).
> ProcessThread.run() will do this in a finally clause:
> 
>             synchronized (_director) {
>                 _director.removeThread(this);
> 
>                 try {
>                     // NOTE: Deadlock risk here.
>                     // Holding a lock on the _director during wrapup()
>                     // might cause deadlock with hierarchical models where
>                     // wrapup() waits for internal actors to conclude,
>                     // doing a wait() on its own internal director.
>                     // Meanwhile, this thread will hold a lock on this
>                     // outside director.  As long as the inside model
>                     // doesn't try to access synchronized methods of
>                     // outside director, this may be OK.
>                     wrapup();
> 
> ProcessThread.wrapup() will call _actor.wrapup(), this is in fact 
> MultiInstanceComposite.wrapup(), this will call 
> relation.setContainer(null);, which in the end will lead to a write 
> access request on the workspace (_workspace.getWriteAccess();), which 
> will call wait();
> This because the first thread still has the read lock, that thread 
> however wants to synchronize with the director, which is not possible 
> since this thread has already done that.
> 
> I believe that this deadlock won't be fixed by your proposal. In this 
> case only one thread has a readlock and only one thread has a write 
> lock. Neither thread has used the wait method you mentioned. Other 
> threads do, but these are all waiting on these two threads.
> 
> Regards,
> Bert
> 
> 2008/12/19 Edward A. Lee <eal at eecs.berkeley.edu 
> <mailto:eal at eecs.berkeley.edu>>
> 
> 
>     I have a diagnosis of this problem, and I believe I have
>     a fix, but as usual with threads, I'm not fully confident
>     in the solution.  I guess if this sounds reasonable, then
>     it would increase my confidence.
> 
>     The MultiInstanceComposite apparently triggers a bug
>     because its wrapup() method acquires write access to the
>     workspace. In your model, multiple threads will be simultaneously
>     trying to acquire write access, something that is fairly rare
>     in uses of Ptolemy.  This is why we see the bug only with
>     uses of MultiInstanceComposite.
> 
>     The problem is in the use of Workspace wait(Object obj) method.
>     What this method does is release any read permissions that
>     the calling thread has on the workspace, call obj.wait(),
>     reacquire the read permissions, and return.
> 
>     The problem is that almost everything in the tree where
>     this is called, it is inside a synchronized block,
>     something like this:
> 
>      synchronized(obj) {
>         ...
>         _workspace.wait(obj);
>         ...
>      }
> 
>     The problem occurs when the wait(Object obj) method tries
>     to reacquire read permissions.  At that point, it holds
>     a lock on obj, and blocks until the workspace grants
>     read permission.
> 
>     If there is a thread waiting for write permission, the read
>     permission is not granted.  The problem occurs when another
>     thread tries to get a lock on obj while holding read or write
>     permission on the workspace. Deadlock.
> 
> 
> 
>     I think that the fix is that a thread that calls
>     wait(Object obj) should not hold a lock on obj when it makes
>     that call... This is counterintuitive to Java programmers,
>     because generally you _have to_ hold the lock to call wait().
>     Indeed, inside wait(Object obj), it acquires the lock, but
>     the key is that it releases that lock before it tries to
>     reacquire read permissions, thus preventing the deadlock
>     if the calling thread does not already hold the lock.
> 
>     I believe this is correct because wait(Object obj) will
>     release any lock on obj anyway for an indeterminate amount
>     of time while obj.wait() is called.  Thus, no calling method
>     can really assume the lock is held across the call
>     to wait(Object obj).
> 
> 
>     Edward
> 
>     Christopher Brooks wrote:
> 
>         Hi Edward,
>         Here's a MultiInstanceComposite model that hangs for me.
>         I've attached a Ptolemy version.
> 
>         The model has PN on the outside with SDF inside the
>         MultiInstanceComposite.  The MultiInstanceComposite has
>         no actors, just a link between the ports, which is rather odd.
> 
>         _Christopher
> 
>         bugzilla-daemon at ecoinformatics.org
>         <mailto:bugzilla-daemon at ecoinformatics.org> wrote:
> 
>             http://bugzilla.ecoinformatics.org/show_bug.cgi?id=3693
> 
> 
> 
> 
> 
>             ------- Comment #4 from crawl at sdsc.edu
>             <mailto:crawl at sdsc.edu>  2008-12-05 11:12 -------
>             I was able to reproduce the deadlock in Jianwu's workflow on:
> 
>             Windows XP, java 1.6.0_11, Kepler 1.0.0
>             Mac, java 1.5.0_16, both Kepler 1.0.0 and head
>             _______________________________________________
>             Kepler-dev mailing list
>             Kepler-dev at kepler-project.org
>             <mailto:Kepler-dev at kepler-project.org>
>             http://mercury.nceas.ucsb.edu/kepler/mailman/listinfo/kepler-dev
> 
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: eal.vcf
Type: text/x-vcard
Size: 364 bytes
Desc: not available
URL: <http://mercury.nceas.ucsb.edu/kepler/pipermail/kepler-dev/attachments/20081219/8437e748/attachment.vcf>