Created attachment 143232 [details]
dmesg excerpt showing the backtraces and other DRM related entries
Playing Hand of Fate 2 leads to reproducible lock ups of my HAWAII Pro GPU. Sometimes directly on initial load, sometimes after playing for a while. The system can be reached over SSH, but the attached input devices are dead (not even num lock changes work). In addition to this the display gets powered off (display turns black and shows behaviour as if looking for input, ie. the connector identifier is show.
In dmesg I can see "flip_done timed out" errors and two backtraces (see attached dmesg excerpt for all the details):
> [15465.441663] [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:44:crtc-0] flip_done timed out
> [15465.451746] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1164561, emitted seq=1164563
> [15465.451751] [drm] GPU recovery disabled.
> [15467.233739] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=171220, emitted seq=171221
> [15467.233746] [drm] GPU recovery disabled.
> [15475.681643] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:44:crtc-0] flip_done timed out
> [15485.921664] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [PLANE:42:plane-5] flip_done timed out
> [15485.921779] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* amdgpu_dm_commit_planes: acrtc 0, already busy
If you need data logged with umr, please provide me with the exact command I should run.
The bug is reproducible with the following stack (Debian testing as a base):
GPU: Hawaii PRO [Radeon R9 290] (ChipID = 0x67b1)
LLVM: SVN:trunk/r351739 (9.0 devel)
Firmware (firmware-amd-graphics): 20190114-1
DDX (xserver-xorg-video-amdgpu): 18.1.0-1
Let me know if you need anything else.
Is this a regression, and if so, can you bisect? Note that it could be a Mesa/LLVM issue rather than a kernel one.
(In reply to Michel Dänzer from comment #1)
> Is this a regression, and if so, can you bisect? Note that it could be a
> Mesa/LLVM issue rather than a kernel one.
Technically yes. But I don't know a (reasonable) good version, because the last time I played this game was in 2017. Between then and now there were many updates for the game (the last one is from 2019-01-11) and an incredible amount of commits for the kernel, Mesa and LLVM.
I'm not even sure if I still used the radeon module the last time or if I was already on amdgpu.
did you try to monitor temperature while running the game?
Did you try to reproduce it with something like glmark2 or FurMark?
see also bug 109466 comment 9
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/684.