Bug 98172 - Concurrent call to glClientWaitSync results in segfault in one of the waiters.
Summary: Concurrent call to glClientWaitSync results in segfault in one of the waiters.
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: 11.2
Hardware: Other All
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-09 17:04 UTC by Suzuki, Shinji
Modified: 2016-10-26 16:51 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Work with a local reference of so->fence (2.07 KB, patch)
2016-10-11 07:36 UTC, Michel Dänzer
Details | Splinter Review
Diff that went into my copy of mesa-source (1.74 KB, patch)
2016-10-12 11:45 UTC, Suzuki, Shinji
Details | Splinter Review
Lock the shared state mutex and work with a local reference of so->fence (3.49 KB, patch)
2016-10-13 09:10 UTC, Michel Dänzer
Details | Splinter Review
Arbitration on so->fence based on Michel-san's patch (2.33 KB, patch)
2016-10-15 03:30 UTC, Suzuki, Shinji
Details | Splinter Review
Arbitration on so->fence through per sync-object mutex. (2.31 KB, patch)
2016-10-15 03:42 UTC, Suzuki, Shinji
Details | Splinter Review
Per sync-object mutex. Redundant 'else' and setting of StatusFlag are eliminated. (2.41 KB, patch)
2016-10-17 16:52 UTC, Suzuki, Shinji
Details | Splinter Review
Only protect concurrent access to so->fence (2.36 KB, patch)
2016-10-20 03:10 UTC, Michel Dänzer
Details | Splinter Review

Description Suzuki, Shinji 2016-10-09 17:04:19 UTC
In my app, a fence is created in Thread-A and it gets passed to Thread-B and Thread-C to be waited upon. (Each thread has its own context.)
Thread-A issues the call, fence = glFenceSync(GL_SYNC_GPU_COMMANDS_COMPLETE, 0).
Thread-B and C issue the call, glClientWaitSync(fence, GL_SYNC_FLUSH_COMMANDS_BIT, GL_TIMEOUT_IGNORED);

Most of the time, wait in both clients succeed but occasionally one of them generates segfault. Upon inspection of generated core, it turned out so->fence in the expression "&so->fence" at line 113 in src/mesa/state_tracker/st_cb_syncobj.c is NULL, which should be causing the segfault down the call chain through screen->fence_reference. I think there is race in executing the code block. If I introduce a mutex in my app with which to avoid concurrent call to glClientWaitSync, I don't observe segfault happening.
Here is the snippet of code in question from st_cb_syncobj.c:

   if (so->fence &&
       screen->fence_finish(screen, so->fence, timeout)) {
      screen->fence_reference(screen, &so->fence, NULL);
      so->b.StatusFlag = GL_TRUE;
   }

My environment is;
Ubuntu 16.04 LTS
Linux a7da 4.4.0-38-generic #57-Ubuntu SMP Tue Sep 6 15:42:33 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
libgl1-mesa-dri:amd64 / 11.2.0-1ubuntu2.2
Radeon HD3300
Comment 1 Suzuki, Shinji 2016-10-09 17:08:57 UTC
Oops, the line number is 114 rather than 113.
Comment 2 Michel Dänzer 2016-10-11 07:36:09 UTC
Created attachment 127204 [details] [review]
Work with a local reference of so->fence

Does this patch help?
Comment 3 Suzuki, Shinji 2016-10-11 16:39:58 UTC
(In reply to Michel Dänzer from comment #2)
> Created attachment 127204 [details] [review] [review]
> Work with a local reference of so->fence
> 
> Does this patch help?

Yes, it does! Thank you for your interest and effort in solving the issue.
I've rebuilt mesa from ubuntu source and run my program with it. Without your patch, the app segfaults like it did with the OS supplied library. With your patch the app has not crashed even once so far. The app crashes, when it does, mainly upon start up like 4 times out of 5. I run the app a few dozen times successfully with the patched library.

My itch is that I can't see why your patch eliminates the race. My understanding is that the segfault happens because the second thread tries to dereference &so->fence, which had been nullified by the first thread and I can't figure out how that is avoided by the patch.

FYI, here is the call-stack when the apps crashes without the patch.

(gdb) where
#0  pb_reference (src=0x0, dst=0x10)
    at ../../../../../../src/gallium/auxiliary/pipebuffer/pb_buffer.h:239
#1  radeon_fence_reference (dst=0x10, src=0x0)
    at ../../../../../../src/gallium/winsys/radeon/drm/radeon_drm_cs.c:670
#2  0x00007f9b52bff441 in r600_fence_reference (screen=<optimized out>, 
    dst=0x7f9b3003c218, src=0x0)
    at ../../../../../src/gallium/drivers/radeon/r600_pipe_common.c:768
#3  0x00007f9b5272ea7f in st_client_wait_sync (ctx=<optimized out>, 
    obj=0x7f9b3003c1f0, flags=<optimized out>, timeout=<optimized out>)
    at ../../../src/mesa/state_tracker/st_cb_syncobj.c:114
#4  0x00007f9b526b0217 in _mesa_ClientWaitSync (sync=<optimized out>, flags=1, 
    timeout=18446744073709551615) at ../../../src/mesa/main/syncobj.c:341
#5  0x0000000000466106 in Fence::clientWaitSync (this=0x7f9b3001c0c0)
    at /home/suzuki/src/xxxxxxx/ndro2016/gl_fence.cpp:25
#6  0x0000000000479e09 in Fenced<pixel_source*>::clientWaitSync (
    this=0x7f9b4cc26e70) at /home/suzuki/src/xxxxxxx/ndro2016/gl_fence.h:59
#7  0x00007f9b4de64f02 in FileReaderThread::work (this=0x1a7b540)
    at /home/suzuki/src/xxxxxxx/ndro2016/thrd_filereader.cpp:63
#8  0x0000000000465586 in WorkerThread::run (this=0x1a7b540)
    at /home/suzuki/src/xxxxxxx/ndro2016/thrd.cpp:173
#9  0x000000000046529e in WorkerThread::entry_func (worker=0x1a7b540)
    at /home/suzuki/src/xxxxxxx/ndro2016/thrd.cpp:144
#10 0x00007f9b567e36fa in start_thread (arg=0x7f9b4cc27700)
    at pthread_create.c:333
#11 0x00007f9b55a74b5d in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Comment 4 Michel Dänzer 2016-10-12 09:40:34 UTC
Shinji-san,

(In reply to shinji.suzuki from comment #3)
> Yes, it does! Thank you for your interest and effort in solving the issue.

Great, thank you for testing the patch.


> My itch is that I can't see why your patch eliminates the race.

Does the commit log of https://patchwork.freedesktop.org/patch/115219/ clarify it?
Comment 5 Suzuki, Shinji 2016-10-12 11:41:30 UTC
Hello Michael-san.
I'm afraid my testing was not enough and I spoke too soon. Today I've run my app remotely through ssh session by pointing DISPLAY to :0 as I'm away from my home machine. Now the app crashes, though much less often, even with modified library containing your patch. Remoting and screen being blanked seems to affect some timing as I get around 25fps only v.s. around 35fps that I get when I work on the console.
Here is the call stack. (assert() is my lame attempt in trying to understand
what code path is being taken, which is apparently compiled to no-op.)

state_tracker/st_cb_syncobj.c
    118        if (screen->fence_finish(screen, fence, timeout)) {             │
   │119           assert(0);                                                   │
  >│120           screen->fence_reference(screen, &so->fence, NULL);           │
   │121           so->b.StatusFlag = GL_TRUE;                                  │
   │122        }                                                               │

r600_pipe_common.c
    766             if (pipe_reference(&(*rdst)->reference, &rsrc->reference)) │
  >│767                     ws->fence_reference(&(*rdst)->gfx, NULL);          │
   │768                     ws->fence_reference(&(*rdst)->sdma, NULL);         │
   │769                     FREE(*rdst);                                       │

radeon_drm_cs.c
    667     static void radeon_fence_reference(struct pipe_fence_handle **dst, │
   │668                                        struct pipe_fence_handle *src)  │
   │669     {                                                                  │
  >│670         pb_reference((struct pb_buffer**)dst, (struct pb_buffer*)src); │
   │671     }                                                                  │

src/gallium/auxiliary/pipebuffer/pb_buffer.h

    235     static inline void                                                 │
   │236     pb_reference(struct pb_buffer **dst,                               │
   │237                  struct pb_buffer *src)                                │
   │238     {                                                                  │
  >│239        struct pb_buffer *old = *dst;                                   │
   │240                                                                        │
   │241        if (pipe_reference(&(*dst)->reference, &src->reference))        │
   │242           pb_destroy( old );                                           │
   │243        *dst = src;                                                     │
   │244     }                                                                  │

(gdb) p old
$1 = <optimized out>
(gdb) p dst
$2 = (struct pb_buffer **) 0x8
(gdb) 

I'll attach the diff that corresponds to your patch to make sure I'm not making mistake in applying your patch by hand as some hunks were rejected by 'patch'.
Comment 6 Suzuki, Shinji 2016-10-12 11:45:11 UTC
Created attachment 127237 [details] [review]
Diff that went into my copy of mesa-source
Comment 7 Michel Dänzer 2016-10-13 09:10:27 UTC
Created attachment 127267 [details] [review]
Lock the shared state mutex and work with a local reference of so->fence

Does this patch fix the problem completely?
Comment 8 Suzuki, Shinji 2016-10-13 15:08:27 UTC
Comment on attachment 127267 [details] [review]
Lock the shared state mutex and work with a local reference of so->fence

Review of attachment 127267 [details] [review]:
-----------------------------------------------------------------

I'm afraid execution of st_fence_sync() can still race.
Thread-A can run upto
  success = screen->fence_finish(screen, pipe, fence, 0);
and then get prempted and Thread-B can run upto the same location.
And then
  screen->fence_reference(screen, &so->fence, NULL);
can still be executed in arbitrary order. If screen->fence_refrence()
is thread-safe and return true only for the first invocation then all is fine but likely it is not true as otherwise we will not be struggling with this issue.
I think the gist of it is that checking of so->fence and nullifying of it should be executed atomically. If "if (success)" is replaced with "if (success && so->fence)" then the program may behave correctly but I'm not confortable about sreen->fence_ference() being called concurrently.
I'm also concerned that mutual exclusion on ctx->Shared->Mutex may introduce unnecessarily strict serialization. 
Can't we introduce per sync-object mutex so that excution of checking of so->fence and nullyfying of it happen atomically?
Is that modification too intrusive? (At least it is unnecessary overhead when st_fence_sync() is not executed concurrently on the same sync object.)
Comment 9 Suzuki, Shinji 2016-10-13 15:27:24 UTC
oops. I'm getting confused. Concurrent programmingis hard.
> screen->fence_reference(screen, &so->fence, NULL);
will not be executed in arbitrary order but serially due to the mutex locking.
Still the out come should be a segfault in the thread that comes late.
Comment 10 Suzuki, Shinji 2016-10-13 15:37:49 UTC
>I'm not confortable about sreen->fence_ference() being called concurrently.

I'm not comfortable about sreen->fence_finish() being called concurrently.
Comment 11 Suzuki, Shinji 2016-10-13 15:46:44 UTC
Sorry I've made too many mistake in writing. (Should have gone to bed before writing.) I'll rewrite whole post below.

I'm afraid execution of st_fence_sync() can still race.
Thread-A can run upto
  success = screen->fence_finish(screen, pipe, fence, 0);
and then get preempted and Thread-B can run upto the same location.
And then
  screen->fence_reference(screen, &so->fence, NULL);
will be executed serially by both threads. If screen->fence_finish()
is thread-safe and return true only for the first invocation then all is fine but likely it is not true as otherwise we will not be struggling with this issue.
I think the gist of it is that checking of so->fence and nullifying of it should be executed atomically. If "if (success)" is replaced with "if (success && so->fence)" then the program may behave correctly but I'm still not comfortable about sreen->fence_finish() being called concurrently.
I'm also concerned that mutual exclusion on ctx->Shared->Mutex may introduce unnecessarily strict serialization. 
Can't we introduce per sync-object mutex so that excution of checking of so->fence and nullyfying of it happen atomically?
Is that modification too intrusive? (At least it is unnecessary overhead when st_fence_sync() is not executed concurrently on the same sync object.)
Comment 12 Michel Dänzer 2016-10-14 01:05:31 UTC
It should be fine. You're right that fence_finish can run concurrently in multiple threads, but even if it returns true in multiple threads, the fence_reference calls are serialized by the mutex. So the second fence_reference call will find so->fence == NULL and do nothing.

Have you run into any problems with this patch in your testing?
Comment 13 Suzuki, Shinji 2016-10-14 02:59:52 UTC
> So the second fence_reference call will find so->fence == NULL and do nothing.

Thank you. I failed to see that fence_reference(screen,&nullptr,0) is no-op thanks to "if (ptr != refererence)" in pipe_reference_descsribed().
I'll get back to you when I finish runnning some tests.
Comment 14 Suzuki, Shinji 2016-10-15 03:26:43 UTC
I've ben running tests with changes derived from your patch, with and without random sleeps to give some shake-ups to execution order with, and have seen no crashes so far. The differences from yours are;
* No mutual exclusions in ctor/dtor functions. (I see no chance of race in ctor and dtor should be called only by the thread that eliminated the last reference.)
* Locking duration is made slightly shorter.

Now I feel sufficiently convinced about the correctness of the fix and at the same time feels that alternative implementation may better be given consideration. As I wrote previously locking on so->Shared.Mutex introduces contention that is not essential. Allowing concurrent calls, in relative to the sync object at hand, to fence_finish() will require more careful implementation from underlining gpu specific drivers, though r600 implementation seems good, and running the same check by multiple threads seems inefficient. I'll attach patch for an alternative implementation as well.
Comment 15 Suzuki, Shinji 2016-10-15 03:30:25 UTC
Created attachment 127316 [details] [review]
Arbitration on so->fence based on Michel-san's patch
Comment 16 Suzuki, Shinji 2016-10-15 03:42:11 UTC
Created attachment 127317 [details] [review]
Arbitration on so->fence through per sync-object mutex.
Comment 17 Suzuki, Shinji 2016-10-15 08:50:05 UTC
I need to take my words back.

> dtor should be called only by the thread that eliminated the last reference.

Looks like glDeleteSync directly calls (or is a macro of?) st_delete_sync_object. If that is true, the reference count on sync-object, if there is, is not respected. Therefore I think mtx_Lock(so->Shared.Mutex) needs to be present in the dtor function as well. I apologize for my oversight.
Comment 18 Suzuki, Shinji 2016-10-15 16:34:11 UTC
> Looks like glDeleteSync directly calls (or is a macro of?)
I thought that was the case because I got this stack trace;
(gdb) where
#0  st_delete_sync_object (ctx=0x863890, obj=0x7fffe8001320)
    at state_tracker/st_cb_syncobj.c:62
#1  0x0000000000466137 in Fence::~Fence (this=0x7fffe80012a0, 
    __in_chrg=<optimized out>)
   at /home/suzuki/src/photron/ndro2016/gl_fence.cpp:31

gl_fence.cpp:31 is the call site of glDeleteSync().
But if I set break point at _mesa_DeleteSync, then I get this.

#0  _mesa_DeleteSync (sync=0x7fffe80017e0) at main/syncobj.c:241
#1  0x0000000000466137 in Fence::~Fence (this=0x7fffe8001760,
    __in_chrg=<optimized out>)
    at /home/suzuki/src/photron/ndro2016/gl_fence.cpp:31

If I step into st_delete_sync_object(), I get same trace as the first one. Looks like as if gdb is not showing correct call trace while the program is executing st_delete_sync_object().

The bottom line is; glDeleteSync() object likely calls st_delete_sync_object not directly but via _mesa_DeleteSync, which does perform reference counting on the sync-object.
Comment 19 Michel Dänzer 2016-10-17 09:24:14 UTC
(In reply to shinji.suzuki from comment #16)
> Created attachment 127317 [details] [review] [review]
> Arbitration on so->fence through per sync-object mutex.

One minor comment below, otherwise looks good to me. Feel free to submit the patch generated by git format-patch to the mesa-dev mailing list for review.


+   if (type == GL_SYNC_FENCE) {
+      struct st_sync_object * so = 
+         (struct st_sync_object*)CALLOC_STRUCT(st_sync_object);
+      mtx_init(&so->mtx, mtx_plain);
+      return (struct gl_sync_object *)so;
+   } else
       return NULL;

This can be written as:
 
   if (type == GL_SYNC_FENCE) {
      [...]
   }
 
   return NULL;
Comment 20 Suzuki, Shinji 2016-10-17 16:49:37 UTC
Thank you. That's certainly easier to read.
Also, now I see that so->b.StatusFlag needs to be set to True only when so->ference is set to NULL in either st_check_sync() or st_client_wait_sync() because so->ference gets set to NULL nowhere else between fence object creation and destruction.
I'll continue to work with the patched lib for another while before submitting a patch to the mailing list.
Comment 21 Suzuki, Shinji 2016-10-17 16:52:27 UTC
Created attachment 127366 [details] [review]
Per sync-object mutex. Redundant 'else'  and setting of StatusFlag are eliminated.
Comment 22 Marek Olšák 2016-10-17 20:14:39 UTC
It would be better to call fence_finish while not holding the mutex. For example:

mtx_lock(&so->mtx);
/* If the fence doesn't exist, assume it's signalled. */
if (so->fence) {
   struct pipe_fence_handle *fence = NULL;
   screen->fence_reference(screen, &fence, so->fence);
   mtx_unlock(&so->mtx);

   if (screen->fence_finish(screen, fence, timeout)) {
      mtx_lock(&so->mtx);
      screen->fence_reference(screen, &so->fence, NULL);
      so->b.StatusFlag = GL_TRUE;
      mtx_unlock(&so->mtx);
   }
   screen->fence_reference(screen, &fence, NULL);
} else {
   mtx_unlock(&so->mtx);
}
Comment 23 Suzuki, Shinji 2016-10-18 02:15:29 UTC
(In reply to Marek Olšák from comment #22)
> It would be better to call fence_finish while not holding the mutex. For
>    mtx_unlock(&so->mtx);
> 
>    if (screen->fence_finish(screen, fence, timeout)) {
>       mtx_lock(&so->mtx);

What are the benefits of allowing multiple outstanding calls to fence_finish() per sync-object?
Comment 24 Marek Olšák 2016-10-18 09:11:32 UTC
Hm. Probably none. The patch looks good.

BTW, the patch won't apply on master, because the fence_finish function is slightly different there.
Comment 25 Suzuki, Shinji 2016-10-18 10:27:40 UTC
Thank you for your comments.
Patches attached here by me are against 11.2.
 I'll base on maser when posting diff to the mailing list.
Comment 26 Michel Dänzer 2016-10-19 01:19:52 UTC
(In reply to Marek Olšák from comment #24)
> Hm. Probably none.

Actually, I think there are: E.g. consider one thread calling glClientWaitSync with a non-0 timeout, blocking for some time with the mutex locked. If another thread calls glClientWaitSync with a 0 timeout (or whichever API call ends up in st_check_sync) during that time, it'll block until the first thread unlocks the mutex.
Comment 27 Michel Dänzer 2016-10-19 01:21:51 UTC
Note that if we allow concurrent fence_finish calls, I don't think we need a per-sync-object mutex.
Comment 28 Suzuki, Shinji 2016-10-19 03:53:22 UTC
Yes. I agree with you that we can do without per-sync-object if  we
allow all waiters enter fence_finish() freely.
With that said, per-sync-object mutex has another benefit of
potentially reducing lock contention among waiters on differing sync
objects and with other mesa components that deals with shared
resources. To be fair I also have to mention that ctx->Shared.Mutex is
touched everywhere that trying to optimize in this particular context
only may not make much sense and adding mutex certainly has associated
overhead. Overall, I vote +1 on your strategy  if free execution of
fence_finish() is to be allowed.


On Wed, Oct 19, 2016 at 10:21 AM,  <bugzilla-daemon@freedesktop.org> wrote:
> Comment # 27 on bug 98172 from Michel Dänzer
>
> Note that if we allow concurrent fence_finish calls, I don't think we need a
> per-sync-object mutex.
>
> ________________________________
> You are receiving this mail because:
>
> You reported the bug.
> You are on the CC list for the bug.
Comment 29 Suzuki, Shinji 2016-10-19 08:32:18 UTC
Sorry. I looked at only your latest comment when I replied previously because I read it in email.

You are absolutely right. fence_finish() MUST run concurrently even if they target same sync-object because, as you say, timeout setting in waits may be different. I was thinking only in terms of infinite waits. I'll reapply your patch and continue watching how my app behaves.
Thank you for noticing this before I post my buggy patch to the mailing list.

BTW, st_check_sync and st_client_wait_sync() seem identical except for the existence of flags and timeout. Should they be merged?

(In reply to Michel Dänzer from comment #26)
> (In reply to Marek Olšák from comment #24)
> > Hm. Probably none.
> 
> Actually, I think there are: E.g. consider one thread calling
> glClientWaitSync with a non-0 timeout, blocking for some time with the mutex
> locked. If another thread calls glClientWaitSync with a 0 timeout (or
> whichever API call ends up in st_check_sync) during that time, it'll block
> until the first thread unlocks the mutex.
Comment 30 Michel Dänzer 2016-10-19 08:58:32 UTC
(In reply to Suzuki, Shinji from comment #29)
> BTW, st_check_sync and st_client_wait_sync() seem identical except for the
> existence of flags and timeout. Should they be merged?

Yes, that'll be patch 2, à la https://patchwork.freedesktop.org/patch/115220/ . :)
Comment 31 Suzuki, Shinji 2016-10-19 10:51:47 UTC
(In reply to Michel Dänzer from comment #30)
> (In reply to Suzuki, Shinji from comment #29)
> Yes, that'll be patch 2, à la
> https://patchwork.freedesktop.org/patch/115220/ . :)
Why is this patch rejected?

Now I'm revisiting your patch. Do we need to have mutual exclusion on
   screen->fence_reference(screen, &fence, so->fence);
and
      screen->fence_reference(screen, &so->fence, NULL);
?
Concurrent calls to screen->fence_reference(screen, &so->fence, NULL) must be arbitrated as they race on dereference/resetting of the lhs value. But I believe, though not too confident, that reference counting itself is implemented in thread-safe manner therefore no arbitration on screen->fence_reference(screen, &fence, so->fence); is needed.
Comment 32 Marek Olšák 2016-10-19 12:37:35 UTC
Reference counting is thread safe, but the assignment in the sync obj structure itself isn't. So you need a mutex.
Comment 33 Suzuki, Shinji 2016-10-19 13:09:32 UTC
(In reply to Marek Olšák from comment #32)
> Reference counting is thread safe, but the assignment in the sync obj
> structure itself isn't. So you need a mutex.

I see no assignment to the sync object structure occurring in evaluating 'screen->fence_reference(screen, &fence, so->fence);'

# I noticed that discussion here is being fed to mesa-dev mailing list do we better move to offline channel? Please feel free to e-mail me.
Comment 34 Michel Dänzer 2016-10-19 15:14:19 UTC
(In reply to Suzuki, Shinji from comment #31)
> Why is this patch rejected?

Because it'll have to be rebased on top of the updated patch 1.


> Now I'm revisiting your patch. Do we need to have mutual exclusion on
>    screen->fence_reference(screen, &fence, so->fence);
> and
>       screen->fence_reference(screen, &so->fence, NULL);
> ?

I'm not sure r600_fence_reference() is thread safe.
Comment 35 Marek Olšák 2016-10-19 17:00:12 UTC
(In reply to Michel Dänzer from comment #34)
> (In reply to Suzuki, Shinji from comment #31)
> > Why is this patch rejected?
> 
> Because it'll have to be rebased on top of the updated patch 1.
> 
> 
> > Now I'm revisiting your patch. Do we need to have mutual exclusion on
> >    screen->fence_reference(screen, &fence, so->fence);
> > and
> >       screen->fence_reference(screen, &so->fence, NULL);
> > ?
> 
> I'm not sure r600_fence_reference() is thread safe.

OK, again:

fence_reference is thread-safe with regard to the reference counter and fence destruction. It doesn't, however, protect the "dst" pointer itself. Therefore:

Not thread safe (race condition on so->fence):
  screen->fence_reference(screen, &so->fence, NULL);

Always thread safe (if fence is a local variable):
  screen->fence_reference(screen, &fence, NULL);
Comment 36 Suzuki, Shinji 2016-10-19 18:47:07 UTC
Michel-san, thank you for elaboration. Maybe this discussion continues because I have failed to express my question clearly. What I'm wondering is that if the following section needs to be protected by ctx->Shared->Mutex or not.

   screen->fence_reference(screen, &fence, so->fence);
   if(!fence) {
      /* If the so->fence has been reset to NULL, the fence have been reached   
         but so->b.StatusFlag may not be set to GL_TRUE yet. Since the caller   
         may check on the value of the flag as soon as the control returns,     
         do the same too although redundant.                                    
      */
      so->b.StatusFlag = GL_TRUE;
      goto quit;
   }

I completely agree that locking is needed in the following section marked with !.
(so->b.StatusFlag = GL_TRUE can be moved out of the block though.)

    if (screen->fence_finish(screen, fence, 0)) {
!      mtx_lock(&ctx->Shared->Mutex);                                            
!      screen->fence_reference(screen, &so->fence, NULL);                        
!      so->b.StatusFlag = GL_TRUE;                                               
!      mtx_unlock(&ctx->Shared->Mutex);                                          
    }
Comment 37 Michel Dänzer 2016-10-20 03:10:46 UTC
Created attachment 127413 [details] [review]
Only protect concurrent access to so->fence

Based on the discussion so far, this patch only protects concurrent acccess to so->fence. Shinji-san, does this work reliably for your test case?
Comment 38 Suzuki, Shinji 2016-10-20 03:29:09 UTC
As you saw, now I'm running my app without locking on the first block of code but with locking on the second. So far so good. But my application is streaming video processing. I assume threads runs in rock-step fashion after a little perturbation at startup (that comes from initializations here and there?) Therefore I'm afraid whether my app runs without crash or not is not a good assessment of correctness.
Comment 39 Michel Dänzer 2016-10-20 03:34:23 UTC
Well, it seems to be the best test case we have so far, and it's the most relevant one for this bug report. :)
Comment 40 Suzuki, Shinji 2016-10-20 03:48:31 UTC
Sorry I misunderstood your question and thought that you are asking if my modification works or not. Your patch should work if mine works because yours has tighter serialization. (My modification does not acquire mutex while duplicating so->fence)

Yours is;
+   mtx_lock(&ctx->Shared->Mutex);
+   screen->fence_reference(screen, &fence, so->fence);
+   mtx_unlock(&ctx->Shared->Mutex);

Mine is;
+   screen->fence_reference(screen, &fence, so->fence);


(In reply to Michel Dänzer from comment #37)
> Created attachment 127413 [details] [review] [review]
> Only protect concurrent access to so->fence
> 
> Based on the discussion so far, this patch only protects concurrent acccess
> to so->fence. Shinji-san, does this work reliably for your test case?
Comment 41 Suzuki, Shinji 2016-10-21 12:13:46 UTC
I think now I have better understanding of the problem we are dealing with here.

>Not thread safe (race condition on so->fence):
>  screen->fence_reference(screen, &so->fence, NULL);
>
>Always thread safe (if fence is a local variable):
>  screen->fence_reference(screen, &fence, NULL);

I think above can be more concisely stated that
"screen->fence_reference(screen, &fence, NULL);
is thread-safe if calls are serialized otherwise not thread safe".

What's fundamentally wrong with the untouched mesa code is that screen->fence_reference(screen, &so->fence, NULL) is potentially called more than once. If the calls are serialized, no crash occurs because the second and later calls behave as no-op. Protecting each call with a mutex is a way to assure that serial execution. But that is an indirect resolution of the problem. A direct resolution is to have screen->fence_reference() not to be called more than once because that shared reference contributes to only one increment in the reference count. Below is my latest attempt.

static void st_client_wait_sync(struct gl_context *ctx,
				struct gl_sync_object *obj,
				GLbitfield flags, GLuint64 timeout)
{
   struct pipe_screen *screen = st_context(ctx)->pipe->screen;                  
   struct st_sync_object *so = (struct st_sync_object*)obj;                     
   struct pipe_fence_handle *fence = NULL;                                      

   /* Duplicate the reference so that the fence object is guaranteed to
    * be alive at least until associated 'unref' below is executed.
    * This is important because multiple threads have to execute
    * fence_finish() concurrently even if they target same fence object
    * to deal with potentially different time-out settings.
    */
   screen->fence_reference(screen, &fence, so->fence);                          

   if (fence && screen->fence_finish(screen, fence, timeout)) {
      if( p_atomic_cmpxchg(&so->fence, fence, NULL) == fence ) {
         /* Get done with 'so->object'. This is a 'unref' op.
          * Borrow the value in 'fence' since so->fence is already
          * set to NULL by the cmpxchg above.
          */
         struct pipe_fence_handle * fence_copy = fence;                         
	 screen->fence_reference(screen, &fence_copy, NULL);                    
      }	
   }
   so->b.StatusFlag = GL_TRUE;                       
   screen->fence_reference(screen, &fence, NULL);                               
}
Comment 42 Suzuki, Shinji 2016-10-22 03:28:44 UTC
Sorry for the clutter. Timeout is not handled in the previous attempt.
(The use of the shared completion flag seems to have potential synchronization problem because a waiter can see a successful wait as a result of unsuccessful call to fence_finish() if another thread comes and completes a successful fence_finish() call after the first thread completes unsuccessful fence_finish() but before returns.)

static void st_client_wait_sync(struct gl_context *ctx,
                                struct gl_sync_object *obj,
                                GLbitfield flags, GLuint64 timeout)
{
   struct pipe_screen *screen = st_context(ctx)->pipe->screen;
   struct st_sync_object *so = (struct st_sync_object*)obj;
   struct pipe_fence_handle *fence = NULL;

   /* We don't care about GL_SYNC_FLUSH_COMMANDS_BIT, because flush is          
    * already called when creating a fence.                                     
    */

   /* Duplicate the reference so that the fence object is guaranteed to         
    * be alive at least until associated 'unref' below is executed.             
    * This is important because multiple threads have to execute                
    * fence_finish() concurrently even if they target same fence object         
    * in order to deal with potentially different time-out settings.            
    */
   screen->fence_reference(screen, &fence, so->fence);

   if (fence) {
      if(screen->fence_finish(screen, fence, timeout)) {
         so->b.StatusFlag = GL_TRUE;
         if( p_atomic_cmpxchg(&so->fence, fence, NULL) == fence ) {
            /* Get done with 'so->object'. This is a 'unref' op.                
             * Borrow the value in 'fence' since so->fence is already           
             * set to NULL by the cmpxchg above.                                
             */
            struct pipe_fence_handle * fence_copy = fence;
            screen->fence_reference(screen, &fence_copy, NULL);
         }
      }
   } // fence==0 means the fence has already reached                            
   screen->fence_reference(screen, &fence, NULL);
}
Comment 43 Michel Dänzer 2016-10-25 07:11:35 UTC
(In reply to Suzuki, Shinji from comment #41)
> "screen->fence_reference(screen, &fence, NULL);
> is thread-safe if calls are serialized otherwise not thread safe".

Not quite.

 screen->fence_reference(screen, &fence, NULL);

is always thread-safe with a local variable fence, because each thread has a separate instance of the local variable, so only one of them will attempt to destroy the fence object referenced by the pointer stored in the local variables.


(In reply to Suzuki, Shinji from comment #42)
> (The use of the shared completion flag seems to have potential
> synchronization problem because a waiter can see a successful wait as a
> result of unsuccessful call to fence_finish() if another thread comes and
> completes a successful fence_finish() call after the first thread completes
> unsuccessful fence_finish() but before returns.)

Since the fence signalled before the first waiter returns, I'd say that's at least sort of correct. :)


>    screen->fence_reference(screen, &fence, so->fence);

[...]

>    if (fence) {
>       if(screen->fence_finish(screen, fence, timeout)) {
>          so->b.StatusFlag = GL_TRUE;
>          if( p_atomic_cmpxchg(&so->fence, fence, NULL) == fence ) {

I'm afraid there's still a race here:

Thread 0                                                            Thread 1
--------                                                            --------

screen->fence_reference(screen, &fence, so->fence);
-> struct r600_multi_fence *rsrc = (struct r600_multi_fence *)src;
   -> rsrc != NULL

                                                                    if (p_atomic_cmpxchg(&so->fence, fence, NULL) == fence) {
                                                                    -> fence != NULL, so->fence == NULL
                                                                       -> continues to destroy the fence object

=> r600_fence_reference and st_client_wait_sync continue working
   with the fence object memory, which was already freed by thread
   1.


Did you run into any particular problem with my latest patch? Assuming the shared mutex contention is your only issue with it, have you actually measured any significant contention with it? If not, I'd like to submit it as the fix for this bug report. It can always be changed to use a per-fence-object mutex later if somebody measures significant contention with the shared mutex.
Comment 44 Suzuki, Shinji 2016-10-25 08:45:51 UTC
> screen->fence_reference(screen, &fence, NULL);
>
> is always thread-safe with a local variable fence,
That case is thread-safe not because the variable is local but because concurrent execution is not possible because the variable is local. :)

>Since the fence signalled before the first waiter returns, I'd say that's at least sort of correct. :)
I agree that it is 'sort of' correct. But a program that is very particular about fence being reached within set time-out period or not will misbehave.

>=> r600_fence_reference and st_client_wait_sync continue working
>   with the fence object memory, which was already freed by thread
>   1.

I see that, since so->fence is duplicated at the beginning of the function, destruction of the fence object would never happen in the block covered by cmpxchg. If it ever happens then that would be by the 'unref' before the function exit.

> Did you run into any particular problem with my latest patch?
I run my app for while with your patch minus the locking during reference duplication and observed no crashes. I'm sufficiently confident that yours is correct and my remaining concern is mostly about lock contention plus the feeling that solution via serialization of unref-ops blurs the nature of the problem that is inherent in the sock code. There is also a concern that if you acquire lock during reference duplication here, someone who study this part of the code later on may think that locking is necessary operation and do the same even though reference duplication is, as far as I can see, lock-free operation. That would further increase contention.
With all that said, I don't mind you applying your patch to the master. I believe it is correct. Your decision must take priority because I think you are in position to devote much more time and energy in improving mesa than I can contribute. In addition, I would not be enough confident about my cmpxchg solution until I write tens of unit-tests to be run days and weeks continuously in various environments, which I perhaps will not have time for.
Comment 45 Suzuki, Shinji 2016-10-25 08:59:52 UTC
> feeling that solution via serialization of unref-ops blurs the nature of the
> problem that is inherent in the sock code. 
This concern can perhaps be obviated by inserting an explicit check.

   if (screen->fence_finish(screen, pipe, fence, timeout)) {
      mtx_lock(&ctx->Shared->Mutex);
      if( so->fence!= NULL )
        screen->fence_reference(screen, &so->fence, NULL);
      mtx_unlock(&ctx->Shared->Mutex);
      so->b.StatusFlag = GL_TRUE;
   }
Comment 46 Michel Dänzer 2016-10-25 09:16:30 UTC
(In reply to Suzuki, Shinji from comment #44)
> >=> r600_fence_reference and st_client_wait_sync continue working
> >   with the fence object memory, which was already freed by thread
> >   1.
> 
> I see that, since so->fence is duplicated at the beginning of the function,
> destruction of the fence object would never happen in the block covered by
> cmpxchg. If it ever happens then that would be by the 'unref' before the
> function exit.

Right, but there's nothing preventing that from happening before r600_fence_reference in thread 0 calls pipe_reference. Mixing pipe_reference and p_atomic_cmpxchg like this isn't safe.
Comment 47 Marek Olšák 2016-10-25 10:53:12 UTC
What was wrong with the initial mutex idea? I think the solution in comment 22 is sufficient to close this bug.
Comment 48 Marek Olšák 2016-10-25 11:19:23 UTC
I've sent my patches that fix this to mesa-dev.
Comment 49 Suzuki, Shinji 2016-10-25 13:28:19 UTC
(In reply to Marek Olšák from comment #47)
> What was wrong with the initial mutex idea? I think the solution in comment
> 22 is sufficient to close this bug.
I did not like added storage overhead of containing a mutex and CPU overhead of initialization and destruction of the mutex even for single threaded execution. Also I wonder if the lock needed to be held while so->fence is inspected & duplicated. (Per sync-object mutex poses less contention than using ctx->Shared.Mutex therefore this point is not worth bothering about.)
However I can't see anything wrong with the fix.
Comment 50 Marek Olšák 2016-10-26 16:51:02 UTC
Fixed by:

commit b687f766fddb7b39479cd9ee0427984029ea3559
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Tue Oct 25 13:10:49 2016 +0200

    st/mesa: allow multiple concurrent waiters in ClientWaitSync


Feel free to re-open the bug if you encounter these issues again.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.