101294 – radeonsi minecraft forge splash freeze since 17.1

Bug 101294 - radeonsi minecraft forge splash freeze since 17.1

Summary: radeonsi minecraft forge splash freeze since 17.1

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	17.1
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:	bisected

Depends on:
Blocks:

Reported:	2017-06-04 03:08 UTC by Tobias Auerochs
Modified:	2017-07-12 01:03 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Thread dump on Arch Linux mesa 17.1.0 PKGBUILD rebuilt with debug symbols (30.38 KB, text/plain) 2017-06-05 03:45 UTC, Tobias Auerochs	Details
Same freeze in QupZilla (main process) with similar backtraces (29.45 KB, text/plain) 2017-06-05 18:18 UTC, Tobias Auerochs	Details
tentative fix (6.70 KB, patch) 2017-06-19 18:13 UTC, Marek Olšák	Details \| Splinter Review
View All

Description Tobias Auerochs 2017-06-04 03:08:11 UTC

Launching Minecraft with Forge installed (vanilla is fine, only happens with the splash screen enabled) often causes a glSwapBuffers to never return and spin on a single cpu core.

The freeze seems to always occur right after the texture atlas was created (can be seen in log) and similarly on the same step on the splash itself, so it is possible that this is some interaction between the splash rendering and loading resources.

The rest of the system continues to run fine and running SIGKILL is enough to stop the process (Other OpenGL-based applications continue function as expected).

Running on Arch Linux, with Mesa 17.1.0, OpenJDK 8.u121-1 with an AMD Radeon RX480 (8 GB) and a custom compiled Linux 4.11.3 kernel (Based linux-zen package with ACS override).

This has happened on older kernel versions as well and the exact Minecraft and Forge versions do not seem to matter (beside that it is at least Minecraft 1.7, which is when that splash screen got added).

I have also observed a very similar freeze (spinning on single core, no system-wide effects) with QupZilla (Qt 5.8, WebEngine), however this is very rare and I could not get any specific info on it. No other programs seem affected.

Comment 1 Michel Dänzer 2017-06-05 02:36:17 UTC

When it's frozen, attach gdb to the process, run

 thread apply all bt

and attach the output here.

Comment 2 Tobias Auerochs 2017-06-05 03:45:20 UTC

Created attachment 131700 [details]
Thread dump on Arch Linux mesa 17.1.0 PKGBUILD rebuilt with debug symbols

Comment 3 Tobias Auerochs 2017-06-05 18:18:00 UTC

Created attachment 131721 [details]
Same freeze in QupZilla (main process) with similar backtraces

Comment 4 Michel Dänzer 2017-06-06 01:45:55 UTC

Looks like a deadlock involving struct pb_cache::mutex and struct amdgpu_winsys::bo_fence_lock. One thread locked the former and wants to lock the latter, another thread the other way around.

Comment 5 Fabian Maurer 2017-06-18 15:47:06 UTC

I can confirm the issue.

It doesn't always happen, but most of the time - it's fairly easy to reproduce. Usually happens when you have a lot mods (I think).

LIBGL_ALWAYS_SOFTWARE resolves that issue and lets it load, and you can play it just fine.

System the issue was tested on:
- Arch Linux 64bit
- Linux 4.11.6, AMDGPU driver
- Mesa 17.2.0-devel (git-58d337941e) / Mesa 17.1.2
- Radeon R9 285

I think it used to work previously, I'll run some regression tests in the coming days.

Comment 6 Fabian Maurer 2017-06-18 23:54:11 UTC

Bisected to
commit 2769dadb0fafdbafc98630fdf96924a3bb209ab7
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Thu Apr 13 23:46:59 2017 +0200

    gallium/radeon: always flush asynchronously and wait after begin_new_cs
    
    This hides the overhead of everything in the driver after the CS flush and
    before returning from pipe_context::flush.
    Only microbenchmarks will benefit.
    
    +2% FPS for glxgears.
    
    Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>
 

Reverting this commit on the recent git tree fixes the freezes.
Now since the bug doesn't always appear, I'm not 100% sure, but I didn't get a single freeze in 20 runs, while I got 13 out of 20 without reverting it.


Could you maybe look into this? I have no idea how to fix the issue without throwing away all changes from this commit.

Comment 7 Marek Olšák 2017-06-19 15:12:35 UTC

The bisected commit only uncovers the existing deadlock scenario.

Summary of the issue.

amdgpu_bo_create
-> pb_cache_reclaim_buffer (lock pb_cache::mutex)
-> pb_cache_is_buffer_compat
-> amdgpu_bo_wait (lock bo_fence_lock) - DEADLOCK

pb_reference
-> pb_destroy
-> amdgpu_bo_destroy_or_cache
-> pb_cache_add_buffer (lock::pb_cache::mutex) - DEADLOCK

amdgpu_cs_flush (lock bo_fence_lock)
-> amdgpu_add_fence_dependency (loop-wait for submission_in_progress) - DEADLOCK


It looks the best way to prevent this deadlock is to unify pb_cache::mutex and bo_fence_lock under one lock, that is, one of them has to go.

Comment 8 Marek Olšák 2017-06-19 15:15:38 UTC

Actually, I think it's 3 locks we have to unify:
- bo_fence_lock
- pb_cache::mutex
- pb_slabs::mutex

Comment 9 Marek Olšák 2017-06-19 16:21:45 UTC

Hm, I think the proper fix is not to loop on submission_in_progress in amdgpu_cs_flush while bo_fence_lock is held. The looping is there because we need to prepare the request.dependencies list. We can remove that looping rather trivially by passing an amdgpu_fence list (with potential submissions in progress) to the CS thread as dependencies, and the CS thread can initialize request.dependencies accordingly.

Comment 10 Marek Olšák 2017-06-19 18:13:43 UTC

Created attachment 132069 [details] [review]
tentative fix

Please test the attached patch. Thanks.

Comment 11 Fabian Maurer 2017-06-19 18:57:51 UTC

Thanks for explaining, I already feared that the issue was deeper.

I think I found a configuration for Minecraft that triggers the issue in ~100% of all tries.
I tested the patch, and it reliable fixes the freezes.

Comment 12 Tobias Auerochs 2017-06-19 19:34:31 UTC

Can also confirm that this fixed the issue for me it seems.

It also patched nicely on 17.1.0, I should update my PKGBUILD I know...

Comment 13 Marek Olšák 2017-06-20 18:05:40 UTC

Fixed by https://cgit.freedesktop.org/mesa/mesa/commit/?id=58af1f6bb074168669aaec2755c7f369a8b58d62

Comment 14 MirceaKitsune 2017-07-11 22:23:43 UTC

Greetings. I'm also getting a GPU crash / system lockup, when running Minecraft as well as other game engines. It doesn't happen during the splash screen, but probabilistically while the game is running. The problem was introduced at the same date as the making of this report, also around the time my distribution upgraded to Mesa 17.1.

I assume I'm experiencing the same bug described here, since I honestly don't see Mesa 17.1 introducing two completely unrelated crashes that would cause a renderer as simple as Minecraft to crash the OS. However I updated to Mesa 17.1.4, and surprisingly it still occurs.

Therefore it's possible that this was not fixed. Since it's not my bug and I don't wish to inconvenience folks, I won't set the status to reopened... however the developers might wish to take a look at this. Here is my own report:

https://bugs.freedesktop.org/show_bug.cgi?id=101672

Comment 15 Michel Dänzer 2017-07-12 01:03:12 UTC

(In reply to MirceaKitsune from comment #14)
> I assume I'm experiencing the same bug described here [...]

You're most definitely not. Many different causes can result in similar symptoms. If the symptoms you experience differ from those described in a bug report by even just a detail, it's better to assume you're experiencing a different bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.