Bug 111763

Summary: ring_gfx hangs/freezes on Navi gpus
Product: DRI Reporter: Marko Popovic <popovic.marko>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium CC: alexandr.kara, Chryseus8080, danielkinsman.nospam, freedesktop, git, jaapbuurman, julian.labus, robobenklein, tesfabpel
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output
none
output of running sudo umr -R gfx_0.0.0
none
Journal excerpt vega56 ring gfx timeout, then gpu reset none

Description Marko Popovic 2019-09-22 12:01:08 UTC
I'm making this topic as a separate tracking of ring_gfx related bugs since we should keep https://bugs.freedesktop.org/show_bug.cgi?id=111481 related to sdma0/1 type freezes since those are ones that seem to cause random "Out of the blue" hangs on the desktop.

There is another type of freeze/hang happening when playing Starcraft II via D9VK. This one doesn't seem to be related to either ngg or dma because I have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs occur anyway, on exactly the same place every time.

Error logs:
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2361623, emitted seq=2361625
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 20236 thread SC2_x64.exe pid 20236

I will try and provide trace files by using renderdoc for described issues. They also happen in native games like Rise of the Tomb Raider and Vulkan etc. Will provide as much info as possible.

Using Kernel 5.3, MESA 19.2 and llvm9.
Comment 1 Jeremy Attali 2019-09-23 02:46:21 UTC
Not sure if that might help someone else, but I found a workaround in my case with DOOM. I was having the same crashes as Marko described with Starcraft II, I tried the following:

- In Steam, I disabled the In Game Steam Overlay
- I switched the Graphics API from OpenGL to Vulkan

I did not have any crash so far. But I haven't tried to isolate one or the other.

Packages:
linux 5.3.arch1-1
linux-firmware-agd5f-radeon-navi10 2019.09.13.18.36-1
mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
libdrm 2.4.99-1
lib32-mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-libdrm 2.4.99-1
Comment 2 Daniel Lu 2019-09-23 06:56:05 UTC
Created attachment 145464 [details]
dmesg output
Comment 3 Daniel Lu 2019-09-23 06:57:45 UTC
Created attachment 145465 [details]
output of running sudo umr -R gfx_0.0.0
Comment 4 Daniel Lu 2019-09-23 07:00:55 UTC
I am seeing a similar hang in Starcraft II. Unlike Marko, I am not using d9vk --- instead, I'm using wine-nine. The hang doesn't happen in all games but seems to be particularly frequent in the coop mission "dead of night".

Using mesa-git 19.3.0_devel.115092.3f5b541fc8b-1.
Comment 5 Doug Ty 2019-09-30 12:18:12 UTC
I've been getting this too with Minecraft:  
https://bugs.freedesktop.org/show_bug.cgi?id=111669

For my particular case at least, AMD_DEBUG=nodma seems to fix it
Comment 6 Marko Popovic 2019-09-30 15:10:38 UTC
(In reply to Doug Ty from comment #5)
> I've been getting this too with Minecraft:  
> https://bugs.freedesktop.org/show_bug.cgi?id=111669
> 
> For my particular case at least, AMD_DEBUG=nodma seems to fix it

(In reply to Marko Popovic from comment #0)
> There is another type of freeze/hang happening when playing Starcraft II via
> D9VK. This one doesn't seem to be related to either ngg or dma because I
> have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs
> occur anyway, on exactly the same place every time.

You are refering to sdma0 / sdma1 type hang which is tracked here:https://bugs.freedesktop.org/show_bug.cgi?id=111481

For ring_gfx hangs they're quite more reproducible and are not affected by AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the bug description.
Comment 7 Doug Ty 2019-09-30 21:55:56 UTC
(In reply to Marko Popovic from comment #6)
> (In reply to Doug Ty from comment #5)
> > I've been getting this too with Minecraft:  
> > https://bugs.freedesktop.org/show_bug.cgi?id=111669
> > 
> > For my particular case at least, AMD_DEBUG=nodma seems to fix it
> 
> You are refering to sdma0 / sdma1 type hang which is tracked
> here:https://bugs.freedesktop.org/show_bug.cgi?id=111481
> 
> For ring_gfx hangs they're quite more reproducible and are not affected by
> AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
> bug description.

Sorry, but this is incorrect. My Minecraft hang is most definitely a ring gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if you'd like to check for yourself.

I can't explain why nodma isn't working for you, perhaps it doesn't work for game? Have you tried putting it in /etc/environment so it's system-wide? I don't know what to tell you regarding nodma, but my hang is definitely ring gfx as well.
Comment 8 Marko Popovic 2019-09-30 22:02:23 UTC
(In reply to Doug Ty from comment #7)
> (In reply to Marko Popovic from comment #6)
> > (In reply to Doug Ty from comment #5)
> > > I've been getting this too with Minecraft:  
> > > https://bugs.freedesktop.org/show_bug.cgi?id=111669
> > > 
> > > For my particular case at least, AMD_DEBUG=nodma seems to fix it
> > 
> > You are refering to sdma0 / sdma1 type hang which is tracked
> > here:https://bugs.freedesktop.org/show_bug.cgi?id=111481
> > 
> > For ring_gfx hangs they're quite more reproducible and are not affected by
> > AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the
> > bug description.
> 
> Sorry, but this is incorrect. My Minecraft hang is most definitely a ring
> gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if
> you'd like to check for yourself.
> 
> I can't explain why nodma isn't working for you, perhaps it doesn't work for
> game? Have you tried putting it in /etc/environment so it's system-wide? I
> don't know what to tell you regarding nodma, but my hang is definitely ring
> gfx as well.

I guess we just have many different types of hangs then... ring_gfx hangs are more mysterious than sdma0/1 hangs it seems, since there is no "universal" workaround for them. nodma works for stopping global sdma-type hangs for me, nongg works for stopping the citra-related hang of ring_gfx type, but none of those 2 variables work for stopping Starcraft II and RoTR ring_gfx-type hangs for me, so it's really really confusing.
Comment 9 Marko Popovic 2019-10-03 12:26:44 UTC
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384

This might actually fix the ring_gfx type hangs or even sdma ones at least for Vulkan API? Not exactly sure but will also be testing the latest MESA builds from Oibaf's PPA in following days and report back on the issue :)
Comment 10 takios+fdbugs 2019-10-11 13:37:19 UTC
(In reply to Marko Popovic from comment #9)
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384
> 
> This might actually fix the ring_gfx type hangs or even sdma ones at least
> for Vulkan API? Not exactly sure but will also be testing the latest MESA
> builds from Oibaf's PPA in following days and report back on the issue :)

Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing Trackmania 2.
Comment 11 Marko Popovic 2019-10-11 13:57:17 UTC
(In reply to takios+fdbugs from comment #10)
> (In reply to Marko Popovic from comment #9)
> > https://cgit.freedesktop.org/mesa/mesa/commit/
> > ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384
> > 
> > This might actually fix the ring_gfx type hangs or even sdma ones at least
> > for Vulkan API? Not exactly sure but will also be testing the latest MESA
> > builds from Oibaf's PPA in following days and report back on the issue :)
> 
> Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing
> Trackmania 2.

Oh yes I forgot to add a reply here. It didn't solve any of the hangs for me either.
Comment 12 shahul 2019-10-15 12:58:00 UTC
I am working on Navi10 RX5700
I am facing below issue when i run unigine-heaven benchmark
 
 [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=5075872, emitted seq=5075874
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process heaven_x64 pid 13723 thread heaven_x64:cs0 pid 13741
 [drm] GPU recovery disabled.

Is any fix for it ? 

Thanks on advance.
Comment 13 Pierre-Eric Pelloux-Prayer 2019-10-15 17:10:22 UTC
For hangs involving radv the AMD_DEBUG options aren't relevant.
You should use RADV_DEBUG instead (probably doesn't support the same values).

Also opening a bug in https://gitlab.freedesktop.org/mesa/mesa/issues is a good idea since gfx hangs are most likely a driver issue (radv or radeonsi, depending on the API used).
Comment 14 wychuchol 2019-10-31 12:09:11 UTC
RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43 d3dx11_43)

Oct 30 02:49:30 pop-os kernel: [ 4864.627343] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted seq=2626286
Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791 thread AnomalyDX11.exe pid 5791
Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.

Happens at random. Sometimes hangs straight away, sometimes can go over an hour without crash. Complete crash, no option available besides hard reset. Not even mouse pointer would move (as with sdma0 hang).

I'm sorry if it's not the right place to report this, I'm somewhat new to all of this.
Comment 15 wychuchol 2019-10-31 12:11:25 UTC
Forgot to add, Kernel v5.4-rc5.
Comment 16 Andrew Sheldon 2019-11-01 01:23:56 UTC
(In reply to wychuchol from comment #14)
> RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
> Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of
> Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43
> d3dx11_43)
> 
> Oct 30 02:49:30 pop-os kernel: [ 4864.627343]
> [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for
> fences timed out!
> Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted
> seq=2626286
> Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791
> thread AnomalyDX11.exe pid 5791
> Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.
> 
> Happens at random. Sometimes hangs straight away, sometimes can go over an
> hour without crash. Complete crash, no option available besides hard reset.
> Not even mouse pointer would move (as with sdma0 hang).
> 
> I'm sorry if it's not the right place to report this, I'm somewhat new to
> all of this.

Ring gfx type hangs tend to be in Mesa. Report here: https://gitlab.freedesktop.org/mesa/mesa/issues

Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO recently for Navi cards. You can try with RADV_PERFTEST=aco environment variable set if your Mesa is new enough, and you might have better luck with hangs.
Comment 17 wychuchol 2019-11-01 16:26:54 UTC
(In reply to Andrew Sheldon from comment #16)
> (In reply to wychuchol from comment #14)
> > RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm
> > Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of
> > Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43
> > d3dx11_43)
> > 
> > Oct 30 02:49:30 pop-os kernel: [ 4864.627343]
> > [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for
> > fences timed out!
> > Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout
> > [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted
> > seq=2626286
> > Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout
> > [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791
> > thread AnomalyDX11.exe pid 5791
> > Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled.
> > 
> > Happens at random. Sometimes hangs straight away, sometimes can go over an
> > hour without crash. Complete crash, no option available besides hard reset.
> > Not even mouse pointer would move (as with sdma0 hang).
> > 
> > I'm sorry if it's not the right place to report this, I'm somewhat new to
> > all of this.
> 
> Ring gfx type hangs tend to be in Mesa. Report here:
> https://gitlab.freedesktop.org/mesa/mesa/issues
> 
> Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO
> recently for Navi cards. You can try with RADV_PERFTEST=aco environment
> variable set if your Mesa is new enough, and you might have better luck with
> hangs.

Thank you so very much, no way to be sure since they seemed to happen at random but I think I'd experience at least 2 or 3 hangs in the time I've tested it but smooth ride so far. No performance impact either but running this game as I do I'm supposedly laying most of the calculations on CPU not GPU.
Comment 18 wychuchol 2019-11-02 12:35:53 UTC
It happened again. This time without a game or anything running, barely logged in and opened a program and boom.

Nov  2 12:42:07 pop-os kernel: [ 1675.883513] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Nov  2 12:42:07 pop-os kernel: [ 1680.747513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2714, emitted seq=2716
Nov  2 12:42:07 pop-os kernel: [ 1680.747549] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2293 thread Xorg:cs0 pid 2294
Nov  2 12:42:07 pop-os kernel: [ 1680.747551] [drm] GPU recovery disabled.

Only cursor moved, no clicks registered, restart achieved with REISUB.
I tried registering at https://gitlab.freedesktop.org/mesa/mesa/issues but I'm getting no account confirmation mail so can't post it there.
Comment 19 wychuchol 2019-11-02 23:11:59 UTC
Perhaps needs another entry started but it's related (since it didn't happen before I tried RADV_PERFTEST=aco and AMD_DEBUG="nongg,nodma") so I'll post it in case someone has had same issues as me.

After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I thought something is overheating (I've noticed graphic card memory in PSensor sometimes reaching 90 so I thought maybe that's what's happening) but I investigated kern.log and this always happened before that autonomous reset:

Nov  2 22:01:53 pop-os kernel: [  979.244964] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Nov  2 22:01:53 pop-os kernel: [  979.244967] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:   device [1987:5012] error status/mask=00001000/00006000
Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:    [12] Timeout               
Nov  2 22:01:53 pop-os kernel: [  979.262629] Emergency Sync complete

A solution I found is to add pci=nommconf in /etc/default/grub to the line 
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" (so it looks like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nommconf").
Comment 20 wychuchol 2019-11-04 16:08:28 UTC
Barely started PC, opened palemoon, curse move only hang and then dozens of graphical artifacts on screen like square patches of glitches. 

Nov  3 13:15:10 pop-os kernel: [  133.998883] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Nov  3 13:15:10 pop-os kernel: [  139.118912] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11145, emitted seq=11148
Nov  3 13:15:10 pop-os kernel: [  139.118956] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2588 thread gnome-shel:cs0 pid 2606
Nov  3 13:15:10 pop-os kernel: [  139.118958] [drm] GPU recovery disabled.

Then sometime later I got ring gfx related crash with Witcher 3 which didn't happen before:
Nov  3 14:08:47 pop-os kernel: [ 3185.175837] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Nov  3 14:08:47 pop-os kernel: [ 3190.039750] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1448573, emitted seq=1448575
Nov  3 14:08:47 pop-os kernel: [ 3190.039786] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process witcher3.exe pid 8100 thread witcher3.exe pid 10168
Nov  3 14:08:47 pop-os kernel: [ 3190.039788] [drm] GPU recovery disabled.
Comment 21 Marko Popovic 2019-11-04 16:10:31 UTC
(In reply to wychuchol from comment #20)
> Barely started PC, opened palemoon, curse move only hang and then dozens of
> graphical artifacts on screen like square patches of glitches. 
> 
> Nov  3 13:15:10 pop-os kernel: [  133.998883]
> [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for
> fences timed out!
> Nov  3 13:15:10 pop-os kernel: [  139.118912] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11145, emitted
> seq=11148
> Nov  3 13:15:10 pop-os kernel: [  139.118956] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process gnome-shell pid 2588 thread
> gnome-shel:cs0 pid 2606
> Nov  3 13:15:10 pop-os kernel: [  139.118958] [drm] GPU recovery disabled.
> 
> Then sometime later I got ring gfx related crash with Witcher 3 which didn't
> happen before:
> Nov  3 14:08:47 pop-os kernel: [ 3185.175837]
> [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for
> fences timed out!
> Nov  3 14:08:47 pop-os kernel: [ 3190.039750] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1448573, emitted
> seq=1448575
> Nov  3 14:08:47 pop-os kernel: [ 3190.039786] [drm:amdgpu_job_timedout
> [amdgpu]] *ERROR* Process information: process witcher3.exe pid 8100 thread
> witcher3.exe pid 10168
> Nov  3 14:08:47 pop-os kernel: [ 3190.039788] [drm] GPU recovery disabled.

What kernel/MESA combo are you using?
Comment 22 wychuchol 2019-11-04 22:13:38 UTC
(In reply to Marko Popovic from comment #21)
> What kernel/MESA combo are you using?

DRM 3.35.0, 5.4.0-050400rc5-generic, LLVM 9.0.0
Mesa 19.3.0-devel (git-ff6e148 2019-10-29 eoan-oibaf-ppa

Or at least that's what I got from glxinfo | grep OpenGL

Stalker hanged again just after few minutes of playtime so I don't know if any of the fixes actually fixed anything or has it held stuff together a bit more securely.

Nov  4 23:04:16 pop-os kernel: [100672.998576] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Nov  4 23:04:16 pop-os kernel: [100677.862509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=23742723, emitted seq=23742725
Nov  4 23:04:16 pop-os kernel: [100677.862545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 3904 thread AnomalyDX11.exe pid 3904
Nov  4 23:04:16 pop-os kernel: [100677.862547] [drm] GPU recovery disabled.
Comment 23 wychuchol 2019-11-05 06:07:33 UTC
(In reply to wychuchol from comment #19)
> After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I
> thought something is overheating (I've noticed graphic card memory in
> PSensor sometimes reaching 90 so I thought maybe that's what's happening)
> but I investigated kern.log and this always happened before that autonomous
> reset:
> 
> Nov  2 22:01:53 pop-os kernel: [  979.244964] pcieport 0000:00:01.1: AER:
> Corrected error received: 0000:01:00.0
> Nov  2 22:01:53 pop-os kernel: [  979.244967] nvme 0000:01:00.0: AER: PCIe
> Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:  
> device [1987:5012] error status/mask=00001000/00006000
> Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:   
> [12] Timeout               
> Nov  2 22:01:53 pop-os kernel: [  979.262629] Emergency Sync complete

Thing with those AER errors is that they can go on and on and reset happens few minutes after the last logged error. 
This might be overheating, I managed to find how to output sensors readings into txt log and found that memory went up to 96 C (or rather it stayed there for about 1m 10s)
Last reading before reset:
amdgpu-pci-2800
Adapter: PCI adapter
vddgfx:       +1.16 V  
fan1:        1551 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +74.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +88.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +96.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:      162.00 W  (cap = 195.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +70.5°C  (high = +70.0°C)
Tctl:         +70.5°C  

Now the weird thing is - if this is in fact overheating why fan didn't go beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from video ports).
Comment 24 wychuchol 2019-11-05 16:28:03 UTC
(In reply to wychuchol from comment #23)
> (In reply to wychuchol from comment #19)
> > After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I
> > thought something is overheating (I've noticed graphic card memory in
> > PSensor sometimes reaching 90 so I thought maybe that's what's happening)
> > but I investigated kern.log and this always happened before that autonomous
> > reset:
> > 
> > Nov  2 22:01:53 pop-os kernel: [  979.244964] pcieport 0000:00:01.1: AER:
> > Corrected error received: 0000:01:00.0
> > Nov  2 22:01:53 pop-os kernel: [  979.244967] nvme 0000:01:00.0: AER: PCIe
> > Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> > Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:  
> > device [1987:5012] error status/mask=00001000/00006000
> > Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:   
> > [12] Timeout               
> > Nov  2 22:01:53 pop-os kernel: [  979.262629] Emergency Sync complete
> 
> Thing with those AER errors is that they can go on and on and reset happens
> few minutes after the last logged error. 
> This might be overheating, I managed to find how to output sensors readings
> into txt log and found that memory went up to 96 C (or rather it stayed
> there for about 1m 10s)
> Last reading before reset:
> amdgpu-pci-2800
> Adapter: PCI adapter
> vddgfx:       +1.16 V  
> fan1:        1551 RPM  (min =    0 RPM, max = 3200 RPM)
> edge:         +74.0°C  (crit = +118.0°C, hyst = -273.1°C)
>                        (emerg = +99.0°C)
> junction:     +88.0°C  (crit = +99.0°C, hyst = -273.1°C)
>                        (emerg = +99.0°C)
> mem:          +96.0°C  (crit = +99.0°C, hyst = -273.1°C)
>                        (emerg = +99.0°C)
> power1:      162.00 W  (cap = 195.00 W)
> 
> k10temp-pci-00c3
> Adapter: PCI adapter
> Tdie:         +70.5°C  (high = +70.0°C)
> Tctl:         +70.5°C  
> 
> Now the weird thing is - if this is in fact overheating why fan didn't go
> beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have
> silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from
> video ports).

Okay I don't think it's overheating anymore. I found a moment in Anomaly 1.5.0 I can't get past without system resetting, just before a psi storm in Army Warehouses (I can provide a savefile).

Last sensors reading before crash (5 second increments):
amdgpu-pci-2800
Adapter: PCI adapter
vddgfx:       +1.01 V  
fan1:        1560 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +69.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +84.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +80.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:      227.00 W  (cap = 195.00 W)

k10temp-pci-00c3
Adapter: PCI adapter
Tdie:         +71.8°C  (high = +70.0°C)
Tctl:         +71.8°C
Comment 25 Ben Klein 2019-11-09 02:54:53 UTC
Created attachment 145918 [details]
Journal excerpt vega56 ring gfx timeout, then gpu reset

I think I'm having this problem on a Vega 56, I didn't see anyone else mention that card here.

I attached the relevant log, I think it's this same issue, but someone correct me if I'm wrong.

OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.33.0, 5.3.0-20-generic, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.1

Running Pop!_OS:
Linux robo-triangulum 5.3.0-20-generic #21+system76~1572304854~19.10~8caa3e6-Ubuntu SMP Tue Oct 29 00:4 x86_64 x86_64 x86_64 GNU/Linux
Comment 26 Marko Popovic 2019-11-09 12:42:56 UTC
(In reply to Ben Klein from comment #25)
> Created attachment 145918 [details]
> Journal excerpt vega56 ring gfx timeout, then gpu reset
> 
> I think I'm having this problem on a Vega 56, I didn't see anyone else
> mention that card here.
> 
> I attached the relevant log, I think it's this same issue, but someone
> correct me if I'm wrong.
> 
> OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.33.0,
> 5.3.0-20-generic, LLVM 9.0.0)
> OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.1
> 
> Running Pop!_OS:
> Linux robo-triangulum 5.3.0-20-generic
> #21+system76~1572304854~19.10~8caa3e6-Ubuntu SMP Tue Oct 29 00:4 x86_64
> x86_64 x86_64 GNU/Linux

Could be, there are a few patches in latest RADV, so try out MESA 20.0 git to see if it fixes anything for you... apparently radv hangs for navi gpus stopped with that fix.
Comment 27 James Wood 2019-11-09 20:12:17 UTC
This doesn't seem to be exclusive to Navi GPUs, I've been having instances of ring gfx timeouts freezing up the system in numerous games such as Project Zomboid (was recently fixed by the developer) and ArmA 3 with the all too familiar dmesg:
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

I'm using:
Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.8-arch1-1, LLVM 9.0.0)
Comment 28 Marko Popovic 2019-11-10 12:20:50 UTC
I think this bug report can be closed now, Mesa 20 git basically fixes radv related ring_gfx hangs, there is still hang that happens in Citra emulator (ngg related) but AMD developers are aware of it so will probably get fixed too.
Comment 29 Daniel Suarez 2019-11-10 13:50:00 UTC
(In reply to Marko Popovic from comment #28)
> I think this bug report can be closed now, Mesa 20 git basically fixes radv
> related ring_gfx hangs, there is still hang that happens in Citra emulator
> (ngg related) but AMD developers are aware of it so will probably get fixed
> too.

Yeah.. "soon". Still waiting for them to fix bug 111481
Comment 30 Marko Popovic 2019-11-10 13:51:29 UTC
(In reply to Daniel Suarez from comment #29)
> (In reply to Marko Popovic from comment #28)
> > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > (ngg related) but AMD developers are aware of it so will probably get fixed
> > too.
> 
> Yeah.. "soon". Still waiting for them to fix bug 111481

SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv related and are fixed now
Comment 31 Daniel Suarez 2019-11-10 13:53:38 UTC
(In reply to Marko Popovic from comment #30)
> (In reply to Daniel Suarez from comment #29)
> > (In reply to Marko Popovic from comment #28)
> > > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > > (ngg related) but AMD developers are aware of it so will probably get fixed
> > > too.
> > 
> > Yeah.. "soon". Still waiting for them to fix bug 111481
> 
> SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv
> related and are fixed now

Still, I can't even play Vulkan titles reliably because the system constantly hangs even with the workarounds in the bug report. AMD really needs to fix them.
Comment 32 Marko Popovic 2019-11-10 13:55:58 UTC
(In reply to Daniel Suarez from comment #31)
> (In reply to Marko Popovic from comment #30)
> > (In reply to Daniel Suarez from comment #29)
> > > (In reply to Marko Popovic from comment #28)
> > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > > > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > > > (ngg related) but AMD developers are aware of it so will probably get fixed
> > > > too.
> > > 
> > > Yeah.. "soon". Still waiting for them to fix bug 111481
> > 
> > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv
> > related and are fixed now
> 
> Still, I can't even play Vulkan titles reliably because the system
> constantly hangs even with the workarounds in the bug report. AMD really
> needs to fix them.

Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled fully so you can't get any hangs that are SDMA related.
Comment 33 Daniel Suarez 2019-11-10 13:58:51 UTC
(In reply to Marko Popovic from comment #32)
> (In reply to Daniel Suarez from comment #31)
> > (In reply to Marko Popovic from comment #30)
> > > (In reply to Daniel Suarez from comment #29)
> > > > (In reply to Marko Popovic from comment #28)
> > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > > > > (ngg related) but AMD developers are aware of it so will probably get fixed
> > > > > too.
> > > > 
> > > > Yeah.. "soon". Still waiting for them to fix bug 111481
> > > 
> > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv
> > > related and are fixed now
> > 
> > Still, I can't even play Vulkan titles reliably because the system
> > constantly hangs even with the workarounds in the bug report. AMD really
> > needs to fix them.
> 
> Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled
> fully so you can't get any hangs that are SDMA related.

That workaround delays the hangs af best, and I have gotten hangs from OpenGl Games and also by using amdvlk. 

Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm just saying that you saying "soon" is very misleading. AMD hasn't still properly fixed bugs that lead to hangs by just watching Firefox, and it's been MONTHS. "soon" for them is months apperantly
Comment 34 Marko Popovic 2019-11-10 14:00:27 UTC
(In reply to Daniel Suarez from comment #33)
> (In reply to Marko Popovic from comment #32)
> > (In reply to Daniel Suarez from comment #31)
> > > (In reply to Marko Popovic from comment #30)
> > > > (In reply to Daniel Suarez from comment #29)
> > > > > (In reply to Marko Popovic from comment #28)
> > > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > > > > > (ngg related) but AMD developers are aware of it so will probably get fixed
> > > > > > too.
> > > > > 
> > > > > Yeah.. "soon". Still waiting for them to fix bug 111481
> > > > 
> > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv
> > > > related and are fixed now
> > > 
> > > Still, I can't even play Vulkan titles reliably because the system
> > > constantly hangs even with the workarounds in the bug report. AMD really
> > > needs to fix them.
> > 
> > Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled
> > fully so you can't get any hangs that are SDMA related.
> 
> That workaround delays the hangs af best, and I have gotten hangs from
> OpenGl Games and also by using amdvlk. 
> 
> Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm
> just saying that you saying "soon" is very misleading. AMD hasn't still
> properly fixed bugs that lead to hangs by just watching Firefox, and it's
> been MONTHS. "soon" for them is months apperantly

And where exactly did I say soon?
Comment 35 Daniel Suarez 2019-11-10 14:04:32 UTC
(In reply to Marko Popovic from comment #34)
> (In reply to Daniel Suarez from comment #33)
> > (In reply to Marko Popovic from comment #32)
> > > (In reply to Daniel Suarez from comment #31)
> > > > (In reply to Marko Popovic from comment #30)
> > > > > (In reply to Daniel Suarez from comment #29)
> > > > > > (In reply to Marko Popovic from comment #28)
> > > > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv
> > > > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator
> > > > > > > (ngg related) but AMD developers are aware of it so will probably get fixed
> > > > > > > too.
> > > > > > 
> > > > > > Yeah.. "soon". Still waiting for them to fix bug 111481
> > > > > 
> > > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv
> > > > > related and are fixed now
> > > > 
> > > > Still, I can't even play Vulkan titles reliably because the system
> > > > constantly hangs even with the workarounds in the bug report. AMD really
> > > > needs to fix them.
> > > 
> > > Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled
> > > fully so you can't get any hangs that are SDMA related.
> > 
> > That workaround delays the hangs af best, and I have gotten hangs from
> > OpenGl Games and also by using amdvlk. 
> > 
> > Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm
> > just saying that you saying "soon" is very misleading. AMD hasn't still
> > properly fixed bugs that lead to hangs by just watching Firefox, and it's
> > been MONTHS. "soon" for them is months apperantly
> 
> And where exactly did I say soon?

My bad, I read "soon" instead of "too", apologies
Comment 36 John H 2019-11-12 23:15:42 UTC
Also, for people who have a 5700XT card, check if yours has dual BIOS's

Typically one is for running at normal clock speeds, and the other is for running overclocked values.

My card, the Powercolor Red Devil 5700XT, is an example of such card, in OC mode I have had all sorts of random freezes and crashes in both Windows AND Linux. 

Since switching to the default clocks, sometimes called Silent mode. I haven't had a single problem since. This is just a heads up for users who have Navi10 based cards with a selectable BIOS
Comment 37 Andrew Sheldon 2019-11-13 00:04:37 UTC
(In reply to Daniel Suarez from comment #33)

> That workaround delays the hangs af best, and I have gotten hangs from
> OpenGl Games and also by using amdvlk. 
> 

Those hangs shouldn't be SDMA related, however. If you are getting hangs from specific games, report them on the corresponding bug tracker (https://gitlab.freedesktop.org/mesa/mesa for OGL and RADV, https://github.com/GPUOpen-Drivers/AMDVLK/issues for AMDVLK).

I suggest using RADV_PERFTEST=aco with mesa-git for the most stable Vulkan experience (or try the AMDGPU-PRO Vulkan driver). 

There's also the "divide error" random hang issue, but it shouldn't be related to SDMA either.
Comment 38 Martin Peres 2019-11-19 09:52:42 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/914.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.