Bug 111763 - ring_gfx hangs/freezes on Navi gpus
Summary: ring_gfx hangs/freezes on Navi gpus
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-22 12:01 UTC by Marko Popovic
Modified: 2019-10-15 17:10 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (86.76 KB, text/plain)
2019-09-23 06:56 UTC, Daniel Lu
no flags Details
output of running sudo umr -R gfx_0.0.0 (19.44 MB, text/x-log)
2019-09-23 06:57 UTC, Daniel Lu
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marko Popovic 2019-09-22 12:01:08 UTC
I'm making this topic as a separate tracking of ring_gfx related bugs since we should keep https://bugs.freedesktop.org/show_bug.cgi?id=111481 related to sdma0/1 type freezes since those are ones that seem to cause random "Out of the blue" hangs on the desktop.

There is another type of freeze/hang happening when playing Starcraft II via D9VK. This one doesn't seem to be related to either ngg or dma because I have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs occur anyway, on exactly the same place every time.

Error logs:
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2361623, emitted seq=2361625
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 20236 thread SC2_x64.exe pid 20236

I will try and provide trace files by using renderdoc for described issues. They also happen in native games like Rise of the Tomb Raider and Vulkan etc. Will provide as much info as possible.

Using Kernel 5.3, MESA 19.2 and llvm9.
Comment 1 Jeremy Attali 2019-09-23 02:46:21 UTC
Not sure if that might help someone else, but I found a workaround in my case with DOOM. I was having the same crashes as Marko described with Starcraft II, I tried the following:

- In Steam, I disabled the In Game Steam Overlay
- I switched the Graphics API from OpenGL to Vulkan

I did not have any crash so far. But I haven't tried to isolate one or the other.

Packages:
linux 5.3.arch1-1
linux-firmware-agd5f-radeon-navi10 2019.09.13.18.36-1
mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
libdrm 2.4.99-1
lib32-mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-libdrm 2.4.99-1
Comment 2 Daniel Lu 2019-09-23 06:56:05 UTC
Created attachment 145464 [details]
dmesg output
Comment 3 Daniel Lu 2019-09-23 06:57:45 UTC
Created attachment 145465 [details]
output of running sudo umr -R gfx_0.0.0
Comment 4 Daniel Lu 2019-09-23 07:00:55 UTC
I am seeing a similar hang in Starcraft II. Unlike Marko, I am not using d9vk --- instead, I'm using wine-nine. The hang doesn't happen in all games but seems to be particularly frequent in the coop mission "dead of night".

Using mesa-git 19.3.0_devel.115092.3f5b541fc8b-1.
Comment 5 Doug Ty 2019-09-30 12:18:12 UTC
I've been getting this too with Minecraft:  
https://bugs.freedesktop.org/show_bug.cgi?id=111669

For my particular case at least, AMD_DEBUG=nodma seems to fix it
Comment 6 Marko Popovic 2019-09-30 15:10:38 UTC
(In reply to Doug Ty from comment #5)
> I've been getting this too with Minecraft:  
> https://bugs.freedesktop.org/show_bug.cgi?id=111669
> 
> For my particular case at least, AMD_DEBUG=nodma seems to fix it

(In reply to Marko Popovic from comment #0)
> There is another type of freeze/hang happening when playing Starcraft II via
> D9VK. This one doesn't seem to be related to either ngg or dma because I
> have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs
> occur anyway, on exactly the same place every time.

You are refering to sdma0 / sdma1 type hang which is tracked here:https://bugs.freedesktop.org/show_bug.cgi?id=111481

For ring_gfx hangs they're quite more reproducible and are not affected by AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the bug description.
Comment 7 Doug Ty 2019-09-30 21:55:56 UTC
(In reply to Marko Popovic from comment #6)
> (In reply to Doug Ty from comment #5)
> > I've been getting this too with Minecraft:  
> > https://bugs.freedesktop.org/show_bug.cgi?id=111669
> > 
> > For my particular case at least, AMD_DEBUG=nodma seems to fix it
> 
> You are refering to sdma0 / sdma1 type hang which is tracked
> here:https://bugs.freedesktop.org/show_bug.cgi?id=111481
> 
> For ring_gfx hangs they're quite more reproducible and are not affected by
> AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
> bug description.

Sorry, but this is incorrect. My Minecraft hang is most definitely a ring gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if you'd like to check for yourself.

I can't explain why nodma isn't working for you, perhaps it doesn't work for game? Have you tried putting it in /etc/environment so it's system-wide? I don't know what to tell you regarding nodma, but my hang is definitely ring gfx as well.
Comment 8 Marko Popovic 2019-09-30 22:02:23 UTC
(In reply to Doug Ty from comment #7)
> (In reply to Marko Popovic from comment #6)
> > (In reply to Doug Ty from comment #5)
> > > I've been getting this too with Minecraft:  
> > > https://bugs.freedesktop.org/show_bug.cgi?id=111669
> > > 
> > > For my particular case at least, AMD_DEBUG=nodma seems to fix it
> > 
> > You are refering to sdma0 / sdma1 type hang which is tracked
> > here:https://bugs.freedesktop.org/show_bug.cgi?id=111481
> > 
> > For ring_gfx hangs they're quite more reproducible and are not affected by
> > AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
> > bug description.
> 
> Sorry, but this is incorrect. My Minecraft hang is most definitely a ring
> gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if
> you'd like to check for yourself.
> 
> I can't explain why nodma isn't working for you, perhaps it doesn't work for
> game? Have you tried putting it in /etc/environment so it's system-wide? I
> don't know what to tell you regarding nodma, but my hang is definitely ring
> gfx as well.

I guess we just have many different types of hangs then... ring_gfx hangs are more mysterious than sdma0/1 hangs it seems, since there is no "universal" workaround for them. nodma works for stopping global sdma-type hangs for me, nongg works for stopping the citra-related hang of ring_gfx type, but none of those 2 variables work for stopping Starcraft II and RoTR ring_gfx-type hangs for me, so it's really really confusing.
Comment 9 Marko Popovic 2019-10-03 12:26:44 UTC
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384

This might actually fix the ring_gfx type hangs or even sdma ones at least for Vulkan API? Not exactly sure but will also be testing the latest MESA builds from Oibaf's PPA in following days and report back on the issue :)
Comment 10 takios+fdbugs 2019-10-11 13:37:19 UTC
(In reply to Marko Popovic from comment #9)
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384
> 
> This might actually fix the ring_gfx type hangs or even sdma ones at least
> for Vulkan API? Not exactly sure but will also be testing the latest MESA
> builds from Oibaf's PPA in following days and report back on the issue :)

Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing Trackmania 2.
Comment 11 Marko Popovic 2019-10-11 13:57:17 UTC
(In reply to takios+fdbugs from comment #10)
> (In reply to Marko Popovic from comment #9)
> > https://cgit.freedesktop.org/mesa/mesa/commit/
> > ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384
> > 
> > This might actually fix the ring_gfx type hangs or even sdma ones at least
> > for Vulkan API? Not exactly sure but will also be testing the latest MESA
> > builds from Oibaf's PPA in following days and report back on the issue :)
> 
> Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing
> Trackmania 2.

Oh yes I forgot to add a reply here. It didn't solve any of the hangs for me either.
Comment 12 shahul 2019-10-15 12:58:00 UTC
I am working on Navi10 RX5700
I am facing below issue when i run unigine-heaven benchmark
 
 [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=5075872, emitted seq=5075874
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process heaven_x64 pid 13723 thread heaven_x64:cs0 pid 13741
 [drm] GPU recovery disabled.

Is any fix for it ? 

Thanks on advance.
Comment 13 Pierre-Eric Pelloux-Prayer 2019-10-15 17:10:22 UTC
For hangs involving radv the AMD_DEBUG options aren't relevant.
You should use RADV_DEBUG instead (probably doesn't support the same values).

Also opening a bug in https://gitlab.freedesktop.org/mesa/mesa/issues is a good idea since gfx hangs are most likely a driver issue (radv or radeonsi, depending on the API used).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.