Bug 92757 - GL_ARB_sync objects are not properly synchronized across threads
Summary: GL_ARB_sync objects are not properly synchronized across threads
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: 11.0
Hardware: All All
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-10-31 19:04 UTC by Steinar H. Gunderson
Modified: 2016-04-11 15:14 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Suggested patch (10.76 KB, patch)
2015-12-01 21:14 UTC, Steinar H. Gunderson
Details | Splinter Review

Description Steinar H. Gunderson 2015-10-31 19:04:18 UTC
Hi,

I have a program which uses multiple threads, with separate but shared contexts. To synchronize data between them, I use fences (from GL_ARB_sync). I've had issues with that occasionally, perhaps every 50000th frame, or so, glWaitSync() would return GL_INVALID_OPERATION (0x501) on a sync object, even though it's supposed to be completely legal.

Looking into src/mesa/main/syncobj.c sheds some light on what the issue might be. It seems many operations are not properly locked at all; in particular, _mesa_validate_sync() is repeatedly called without any locks held, yet looks into a hashtable in ctx->Shared (that is modified by other threads, although those threads hold the lock).

Just adding a lock to _mesa_validate_sync() would seem not to be enough, though. In particular, you have functions like _mesa_ClientWaitSync() which does _mesa_validate_sync(), does some other work and only then calls _mesa_ref_sync_object(); another thread could have deleted the object between validation and ref. And finally, even within the ref/unref pair, there's checking of various state without any locking, which probably needs some thought to verify that it's actually fine if another thread should be modifying the sync object in the meantime.

Perhaps the best thing to do would just be to lock ctx->Shared whenever dealing with a sync object, but one would have to make sure e.g. glClientWaitSync() doesn't sleep while keeping that lock.
Comment 1 Ilia Mirkin 2015-10-31 19:19:48 UTC
After looking at the code briefly, it looks like _mesa_validate_sync really needs to be _mesa_validate_sync_and_get_ref. Otherwise another thread might delete and unref the last RefCount, deleting the object.

Not entirely sure how sharing between multiple threads is done with this, but from the looks of it, there's no refcount acquired there. So each function has to acquire a refcount for the duration of the call.

This is separate from fixing validate_sync to not race against adds/removes to the hash set.
Comment 2 Ian Romanick 2015-11-30 23:04:44 UTC
I think there are quite a few kinds of GL objects that suffer from this problem.  I recall Matt sending out some patches a few months ago that changed some of the locking around objects.  I noticed this pre-existing problem while reviewing his code.
Comment 3 Steinar H. Gunderson 2015-12-01 21:14:13 UTC
Created attachment 120238 [details] [review]
Suggested patch

Hi,

I think the included patch should resolve the problem. I've run with a variant of it for a while without any problems, although I had to go through some conflicts and such to forward-port it against git master, so testing would be appreciated.
Comment 4 Steinar H. Gunderson 2016-04-11 15:14:52 UTC
This was submitted as feb53912f8d8c29594a9fdff914d78bb36d6d56b (part of 11.2.0), but seemingly is missing from 11.1.2. Anyway, I'm closing it as fixed since 11.2.0 is out.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.