Bug 105339 - Deadlock inside glClientWaitSync [Regresion bc65dcab3bc48673ff6180afb036561a4b8b1119]
Summary: Deadlock inside glClientWaitSync [Regresion bc65dcab3bc48673ff6180afb036561a4...
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: Other Linux (All)
: medium blocker
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
Depends on:
Reported: 2018-03-04 23:55 UTC by Matias N. Goldberg
Modified: 2018-06-20 17:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:

Binary test built with Debug & full symbols (14.46 MB, application/x-7z-compressed)
2018-03-05 00:56 UTC, Matias N. Goldberg
Relevant Source Code (1.98 MB, application/x-7z-compressed)
2018-03-05 00:56 UTC, Matias N. Goldberg

Description Matias N. Goldberg 2018-03-04 23:55:28 UTC
Calling glClientWaitSync under specific conditions will run into an unrecoverable deadlock.

The only known workaround is to issue a glFlush before glClientWaitSync.

I originally discovered this problem in the Dolphin Emulator, see ticket https://bugs.dolphin-emu.org/issues/10904
However I am now reporting it because I was able to reproduce this bug independently.

Reported affected systems so far:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
16GB RAM (1 stick)
GPU: Radeon RX 560 Series (POLARIS11 / DRM 3.19.0 / 4.14.11, LLVM 6.0.0)
Mesa 18.1.0-devel (git-183ce5e629)
Xubuntu 17.10
Kernel 4.14.11

i3 4150 @ 3.50ghz
Ubuntu 17.10
Kernel 4.15.4
Mesa 18.1.0-devel

I will try to upload a simple repro if I can in the next few hours.
I stumble on this issue because our Ogre 2.2 sample "Sample_PlanarReflections" is affected by it.

My git version is stuck at 847d0a393d7f0f967f39302900d5330f32b804c8 due to an unrelated regression reported at https://bugs.freedesktop.org/show_bug.cgi?id=105218

However I know the bug is still present as of 1f5618e81c00199d3349b1ade797382635b2af85 (which is not latest)
Comment 1 Matias N. Goldberg 2018-03-05 00:56:03 UTC
Created attachment 137783 [details]
Binary test built with Debug & full symbols
Comment 2 Matias N. Goldberg 2018-03-05 00:56:58 UTC
Created attachment 137784 [details]
Relevant Source Code
Comment 3 Matias N. Goldberg 2018-03-05 01:02:03 UTC
I've uploaded a binary with the repro.
Unfortunately it wasn't easy to repro the problem on a simpler one-liner test case.

Just download the binary and run Sample_PlanarReflections-2.2.0
Let me know if you have issues executing the file (e.g. a hardcoded path slipped through, missing library)

Just move around the scene (WASD + mouse). It should hang within the first minute. It often hangs in the first 10 seconds, but it can take up to 2 minutes, at least on my machine.

As for the code, it hangs inside GL3PlusRenderSystem::_endFrame in RenderSystems/GL3Plus/src/OgreGL3PlusRenderSystem.cpp which purposedly runs a lot of fences to trigger the deadlock.
I included the source code so the symbols work for you

If anyone wants to build it from source code, let me know and I will assist. I'm using Ogre 2.2's f7302ccfa4a9fde3f0e47835924f37db1b3b06b8 build, but OgreGL3PlusRenderSystem.cpp has been modified to trigger the bug more easily.

Please note that only this sample so far appears to trigger the race condition.
Comment 4 Matias N. Goldberg 2018-03-05 01:08:50 UTC
By the way, if I change the waits to the following:

    waitDuration = 1 second;
    waitRet = glClientWaitSync( fenceName, waitFlags, waitDuration );
    assert( waitRet != GL_WAIT_FAILED );

Then it still deadlocks. glClientWaitSync returns, but the fence never completes, leaving the while() loop as an infinite loop.

Once it starts deadlock, if I step inside si_fence_finish I can see that rfence->tc_token is 0, which either means that it was always 0, or it has been already zeroed.

I do not know how to continue debugging this race condition as I am not familiar with the code.
Comment 5 Matias N. Goldberg 2018-03-09 21:48:24 UTC
I traced the regression to commit:

commit bc65dcab3bc48673ff6180afb036561a4b8b1119
Author: Nicolai Hähnle <nicolai.haehnle@amd.com>
Date:   Fri Nov 10 10:58:10 2017 +0100

    radeonsi: avoid syncing the driver thread in si_fence_finish
    It is really only required when we need to flush for deferred fences.
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>

Although I slightly suspect the former code was just making the race condition much harder to trigger, considering I've played other Dolphin games in the past (before this regression) and they ocasionally hanged in a similar way after 2-4 hours of continuous play or so (extremely rare to trigger) and it wouldn't always happen (But that may have been a different bug).
Comment 6 Ben Clapp 2018-04-04 14:09:56 UTC
With a TR 1950X CPU, RX 580 GPU, Debian testing branch (buster), Mesa 18.0, I'm also able to reproduce this bug. (I also discovered it using Dolphin.)
The issue wasn't present in 17.3.7, but when I made the jump to 18.0 it began occurring.
The exact timing of the freeze is a bit inconsistent, but I can get it to happen fairly quickly and consistently.
It seems to be strictly an application freeze, as opposed to a GPU hang, you can kill dolphin-emu and continue using your system without issue/reboot.
Comment 7 Gregor Münch 2018-05-01 16:24:24 UTC
added author of regression
Comment 8 Marek Olšák 2018-06-20 17:24:36 UTC
I think this one is fixed by:

commit 7083ac7290a0c37a45494437a45441112f3cc36c
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Tue Apr 24 17:01:35 2018 -0400

    util/u_queue: fix a deadlock in util_queue_finish
    Cc: 18.0 18.1 <mesa-stable@lists.freedesktop.org>
    Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

Feel free to reopen if you encounter the issue again.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.