Summary: | ring_gfx hangs/freezes on Navi gpus | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Marko Popovic <popovic.marko> | ||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||
Severity: | major | ||||||||||
Priority: | medium | CC: | alexandr.kara, Chryseus8080, danielkinsman.nospam, freedesktop, git, jaapbuurman, julian.labus, robobenklein, tesfabpel | ||||||||
Version: | unspecified | ||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||
OS: | Linux (All) | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Marko Popovic
2019-09-22 12:01:08 UTC
Not sure if that might help someone else, but I found a workaround in my case with DOOM. I was having the same crashes as Marko described with Starcraft II, I tried the following: - In Steam, I disabled the In Game Steam Overlay - I switched the Graphics API from OpenGL to Vulkan I did not have any crash so far. But I haven't tried to isolate one or the other. Packages: linux 5.3.arch1-1 linux-firmware-agd5f-radeon-navi10 2019.09.13.18.36-1 mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1 vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1 libdrm 2.4.99-1 lib32-mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1 lib32-vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1 lib32-libdrm 2.4.99-1 Created attachment 145464 [details]
dmesg output
Created attachment 145465 [details]
output of running sudo umr -R gfx_0.0.0
I am seeing a similar hang in Starcraft II. Unlike Marko, I am not using d9vk --- instead, I'm using wine-nine. The hang doesn't happen in all games but seems to be particularly frequent in the coop mission "dead of night". Using mesa-git 19.3.0_devel.115092.3f5b541fc8b-1. I've been getting this too with Minecraft: https://bugs.freedesktop.org/show_bug.cgi?id=111669 For my particular case at least, AMD_DEBUG=nodma seems to fix it (In reply to Doug Ty from comment #5) > I've been getting this too with Minecraft: > https://bugs.freedesktop.org/show_bug.cgi?id=111669 > > For my particular case at least, AMD_DEBUG=nodma seems to fix it (In reply to Marko Popovic from comment #0) > There is another type of freeze/hang happening when playing Starcraft II via > D9VK. This one doesn't seem to be related to either ngg or dma because I > have them both disabled by AMD_DEBUG=nodma and AMD_DEBUG=nongg and the hangs > occur anyway, on exactly the same place every time. You are refering to sdma0 / sdma1 type hang which is tracked here:https://bugs.freedesktop.org/show_bug.cgi?id=111481 For ring_gfx hangs they're quite more reproducible and are not affected by AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the bug description. (In reply to Marko Popovic from comment #6) > (In reply to Doug Ty from comment #5) > > I've been getting this too with Minecraft: > > https://bugs.freedesktop.org/show_bug.cgi?id=111669 > > > > For my particular case at least, AMD_DEBUG=nodma seems to fix it > > You are refering to sdma0 / sdma1 type hang which is tracked > here:https://bugs.freedesktop.org/show_bug.cgi?id=111481 > > For ring_gfx hangs they're quite more reproducible and are not affected by > AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the > bug description. Sorry, but this is incorrect. My Minecraft hang is most definitely a ring gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if you'd like to check for yourself. I can't explain why nodma isn't working for you, perhaps it doesn't work for game? Have you tried putting it in /etc/environment so it's system-wide? I don't know what to tell you regarding nodma, but my hang is definitely ring gfx as well. (In reply to Doug Ty from comment #7) > (In reply to Marko Popovic from comment #6) > > (In reply to Doug Ty from comment #5) > > > I've been getting this too with Minecraft: > > > https://bugs.freedesktop.org/show_bug.cgi?id=111669 > > > > > > For my particular case at least, AMD_DEBUG=nodma seems to fix it > > > > You are refering to sdma0 / sdma1 type hang which is tracked > > here:https://bugs.freedesktop.org/show_bug.cgi?id=111481 > > > > For ring_gfx hangs they're quite more reproducible and are not affected by > > AMD_DEBUG=nodma or AMD_DEBUG=nongg which I already mentioned above in the > > bug description. > > Sorry, but this is incorrect. My Minecraft hang is most definitely a ring > gfx hang, *not* sdma. I've posted logs and apitraces in the linked thread if > you'd like to check for yourself. > > I can't explain why nodma isn't working for you, perhaps it doesn't work for > game? Have you tried putting it in /etc/environment so it's system-wide? I > don't know what to tell you regarding nodma, but my hang is definitely ring > gfx as well. I guess we just have many different types of hangs then... ring_gfx hangs are more mysterious than sdma0/1 hangs it seems, since there is no "universal" workaround for them. nodma works for stopping global sdma-type hangs for me, nongg works for stopping the citra-related hang of ring_gfx type, but none of those 2 variables work for stopping Starcraft II and RoTR ring_gfx-type hangs for me, so it's really really confusing. https://cgit.freedesktop.org/mesa/mesa/commit/?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384 This might actually fix the ring_gfx type hangs or even sdma ones at least for Vulkan API? Not exactly sure but will also be testing the latest MESA builds from Oibaf's PPA in following days and report back on the issue :) (In reply to Marko Popovic from comment #9) > https://cgit.freedesktop.org/mesa/mesa/commit/ > ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384 > > This might actually fix the ring_gfx type hangs or even sdma ones at least > for Vulkan API? Not exactly sure but will also be testing the latest MESA > builds from Oibaf's PPA in following days and report back on the issue :) Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing Trackmania 2. (In reply to takios+fdbugs from comment #10) > (In reply to Marko Popovic from comment #9) > > https://cgit.freedesktop.org/mesa/mesa/commit/ > > ?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384 > > > > This might actually fix the ring_gfx type hangs or even sdma ones at least > > for Vulkan API? Not exactly sure but will also be testing the latest MESA > > builds from Oibaf's PPA in following days and report back on the issue :) > > Sadly, I'm still getting the ring_gfx hangs after a few minutes of playing > Trackmania 2. Oh yes I forgot to add a reply here. It didn't solve any of the hangs for me either. I am working on Navi10 RX5700 I am facing below issue when i run unigine-heaven benchmark [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out! [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=5075872, emitted seq=5075874 [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process heaven_x64 pid 13723 thread heaven_x64:cs0 pid 13741 [drm] GPU recovery disabled. Is any fix for it ? Thanks on advance. For hangs involving radv the AMD_DEBUG options aren't relevant. You should use RADV_DEBUG instead (probably doesn't support the same values). Also opening a bug in https://gitlab.freedesktop.org/mesa/mesa/issues is a good idea since gfx hangs are most likely a driver issue (radv or radeonsi, depending on the API used). RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43 d3dx11_43) Oct 30 02:49:30 pop-os kernel: [ 4864.627343] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out! Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted seq=2626286 Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791 thread AnomalyDX11.exe pid 5791 Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled. Happens at random. Sometimes hangs straight away, sometimes can go over an hour without crash. Complete crash, no option available besides hard reset. Not even mouse pointer would move (as with sdma0 hang). I'm sorry if it's not the right place to report this, I'm somewhat new to all of this. Forgot to add, Kernel v5.4-rc5. (In reply to wychuchol from comment #14) > RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm > Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of > Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43 > d3dx11_43) > > Oct 30 02:49:30 pop-os kernel: [ 4864.627343] > [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for > fences timed out! > Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted > seq=2626286 > Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791 > thread AnomalyDX11.exe pid 5791 > Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled. > > Happens at random. Sometimes hangs straight away, sometimes can go over an > hour without crash. Complete crash, no option available besides hard reset. > Not even mouse pointer would move (as with sdma0 hang). > > I'm sorry if it's not the right place to report this, I'm somewhat new to > all of this. Ring gfx type hangs tend to be in Mesa. Report here: https://gitlab.freedesktop.org/mesa/mesa/issues Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO recently for Navi cards. You can try with RADV_PERFTEST=aco environment variable set if your Mesa is new enough, and you might have better luck with hangs. (In reply to Andrew Sheldon from comment #16) > (In reply to wychuchol from comment #14) > > RX 5700 XT Pop OS 19.10 latest Oibaf mesa not sure what llvm > > Anomaly 1.5.0 update 3 standalone 64 bit mod for S.T.A.L.K.E.R. Call of > > Pripyat running under wine d3dx11_43->dxvk (winetricks dxvk d3dcompiler_43 > > d3dx11_43) > > > > Oct 30 02:49:30 pop-os kernel: [ 4864.627343] > > [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for > > fences timed out! > > Oct 30 02:49:30 pop-os kernel: [ 4869.231450] [drm:amdgpu_job_timedout > > [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2626284, emitted > > seq=2626286 > > Oct 30 02:49:30 pop-os kernel: [ 4869.231486] [drm:amdgpu_job_timedout > > [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 5791 > > thread AnomalyDX11.exe pid 5791 > > Oct 30 02:49:30 pop-os kernel: [ 4869.231487] [drm] GPU recovery disabled. > > > > Happens at random. Sometimes hangs straight away, sometimes can go over an > > hour without crash. Complete crash, no option available besides hard reset. > > Not even mouse pointer would move (as with sdma0 hang). > > > > I'm sorry if it's not the right place to report this, I'm somewhat new to > > all of this. > > Ring gfx type hangs tend to be in Mesa. Report here: > https://gitlab.freedesktop.org/mesa/mesa/issues > > Also I'm not sure how up to date the Oibaf repo is, but Mesa git landed ACO > recently for Navi cards. You can try with RADV_PERFTEST=aco environment > variable set if your Mesa is new enough, and you might have better luck with > hangs. Thank you so very much, no way to be sure since they seemed to happen at random but I think I'd experience at least 2 or 3 hangs in the time I've tested it but smooth ride so far. No performance impact either but running this game as I do I'm supposedly laying most of the calculations on CPU not GPU. It happened again. This time without a game or anything running, barely logged in and opened a program and boom. Nov 2 12:42:07 pop-os kernel: [ 1675.883513] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out! Nov 2 12:42:07 pop-os kernel: [ 1680.747513] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2714, emitted seq=2716 Nov 2 12:42:07 pop-os kernel: [ 1680.747549] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2293 thread Xorg:cs0 pid 2294 Nov 2 12:42:07 pop-os kernel: [ 1680.747551] [drm] GPU recovery disabled. Only cursor moved, no clicks registered, restart achieved with REISUB. I tried registering at https://gitlab.freedesktop.org/mesa/mesa/issues but I'm getting no account confirmation mail so can't post it there. Perhaps needs another entry started but it's related (since it didn't happen before I tried RADV_PERFTEST=aco and AMD_DEBUG="nongg,nodma") so I'll post it in case someone has had same issues as me. After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I thought something is overheating (I've noticed graphic card memory in PSensor sometimes reaching 90 so I thought maybe that's what's happening) but I investigated kern.log and this always happened before that autonomous reset: Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0 Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: device [1987:5012] error status/mask=00001000/00006000 Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: [12] Timeout Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete A solution I found is to add pci=nommconf in /etc/default/grub to the line GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" (so it looks like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nommconf"). Barely started PC, opened palemoon, curse move only hang and then dozens of graphical artifacts on screen like square patches of glitches. Nov 3 13:15:10 pop-os kernel: [ 133.998883] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out! Nov 3 13:15:10 pop-os kernel: [ 139.118912] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11145, emitted seq=11148 Nov 3 13:15:10 pop-os kernel: [ 139.118956] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 2588 thread gnome-shel:cs0 pid 2606 Nov 3 13:15:10 pop-os kernel: [ 139.118958] [drm] GPU recovery disabled. Then sometime later I got ring gfx related crash with Witcher 3 which didn't happen before: Nov 3 14:08:47 pop-os kernel: [ 3185.175837] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out! Nov 3 14:08:47 pop-os kernel: [ 3190.039750] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1448573, emitted seq=1448575 Nov 3 14:08:47 pop-os kernel: [ 3190.039786] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process witcher3.exe pid 8100 thread witcher3.exe pid 10168 Nov 3 14:08:47 pop-os kernel: [ 3190.039788] [drm] GPU recovery disabled. (In reply to wychuchol from comment #20) > Barely started PC, opened palemoon, curse move only hang and then dozens of > graphical artifacts on screen like square patches of glitches. > > Nov 3 13:15:10 pop-os kernel: [ 133.998883] > [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for > fences timed out! > Nov 3 13:15:10 pop-os kernel: [ 139.118912] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=11145, emitted > seq=11148 > Nov 3 13:15:10 pop-os kernel: [ 139.118956] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* Process information: process gnome-shell pid 2588 thread > gnome-shel:cs0 pid 2606 > Nov 3 13:15:10 pop-os kernel: [ 139.118958] [drm] GPU recovery disabled. > > Then sometime later I got ring gfx related crash with Witcher 3 which didn't > happen before: > Nov 3 14:08:47 pop-os kernel: [ 3185.175837] > [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for > fences timed out! > Nov 3 14:08:47 pop-os kernel: [ 3190.039750] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=1448573, emitted > seq=1448575 > Nov 3 14:08:47 pop-os kernel: [ 3190.039786] [drm:amdgpu_job_timedout > [amdgpu]] *ERROR* Process information: process witcher3.exe pid 8100 thread > witcher3.exe pid 10168 > Nov 3 14:08:47 pop-os kernel: [ 3190.039788] [drm] GPU recovery disabled. What kernel/MESA combo are you using? (In reply to Marko Popovic from comment #21) > What kernel/MESA combo are you using? DRM 3.35.0, 5.4.0-050400rc5-generic, LLVM 9.0.0 Mesa 19.3.0-devel (git-ff6e148 2019-10-29 eoan-oibaf-ppa Or at least that's what I got from glxinfo | grep OpenGL Stalker hanged again just after few minutes of playtime so I don't know if any of the fixes actually fixed anything or has it held stuff together a bit more securely. Nov 4 23:04:16 pop-os kernel: [100672.998576] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out! Nov 4 23:04:16 pop-os kernel: [100677.862509] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=23742723, emitted seq=23742725 Nov 4 23:04:16 pop-os kernel: [100677.862545] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process AnomalyDX11.exe pid 3904 thread AnomalyDX11.exe pid 3904 Nov 4 23:04:16 pop-os kernel: [100677.862547] [drm] GPU recovery disabled. (In reply to wychuchol from comment #19) > After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I > thought something is overheating (I've noticed graphic card memory in > PSensor sometimes reaching 90 so I thought maybe that's what's happening) > but I investigated kern.log and this always happened before that autonomous > reset: > > Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER: > Corrected error received: 0000:01:00.0 > Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe > Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > device [1987:5012] error status/mask=00001000/00006000 > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > [12] Timeout > Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete Thing with those AER errors is that they can go on and on and reset happens few minutes after the last logged error. This might be overheating, I managed to find how to output sensors readings into txt log and found that memory went up to 96 C (or rather it stayed there for about 1m 10s) Last reading before reset: amdgpu-pci-2800 Adapter: PCI adapter vddgfx: +1.16 V fan1: 1551 RPM (min = 0 RPM, max = 3200 RPM) edge: +74.0°C (crit = +118.0°C, hyst = -273.1°C) (emerg = +99.0°C) junction: +88.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) mem: +96.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) power1: 162.00 W (cap = 195.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tdie: +70.5°C (high = +70.0°C) Tctl: +70.5°C Now the weird thing is - if this is in fact overheating why fan didn't go beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from video ports). (In reply to wychuchol from comment #23) > (In reply to wychuchol from comment #19) > > After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I > > thought something is overheating (I've noticed graphic card memory in > > PSensor sometimes reaching 90 so I thought maybe that's what's happening) > > but I investigated kern.log and this always happened before that autonomous > > reset: > > > > Nov 2 22:01:53 pop-os kernel: [ 979.244964] pcieport 0000:00:01.1: AER: > > Corrected error received: 0000:01:00.0 > > Nov 2 22:01:53 pop-os kernel: [ 979.244967] nvme 0000:01:00.0: AER: PCIe > > Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID) > > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > > device [1987:5012] error status/mask=00001000/00006000 > > Nov 2 22:01:53 pop-os kernel: [ 979.244968] nvme 0000:01:00.0: AER: > > [12] Timeout > > Nov 2 22:01:53 pop-os kernel: [ 979.262629] Emergency Sync complete > > Thing with those AER errors is that they can go on and on and reset happens > few minutes after the last logged error. > This might be overheating, I managed to find how to output sensors readings > into txt log and found that memory went up to 96 C (or rather it stayed > there for about 1m 10s) > Last reading before reset: > amdgpu-pci-2800 > Adapter: PCI adapter > vddgfx: +1.16 V > fan1: 1551 RPM (min = 0 RPM, max = 3200 RPM) > edge: +74.0°C (crit = +118.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > junction: +88.0°C (crit = +99.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > mem: +96.0°C (crit = +99.0°C, hyst = -273.1°C) > (emerg = +99.0°C) > power1: 162.00 W (cap = 195.00 W) > > k10temp-pci-00c3 > Adapter: PCI adapter > Tdie: +70.5°C (high = +70.0°C) > Tctl: +70.5°C > > Now the weird thing is - if this is in fact overheating why fan didn't go > beyond 1600 rpm even once.... Highest was like 1581 rpm and I don't have > silent bios switched on (sapphire pulse rx 5700 xt, lever facing away from > video ports). Okay I don't think it's overheating anymore. I found a moment in Anomaly 1.5.0 I can't get past without system resetting, just before a psi storm in Army Warehouses (I can provide a savefile). Last sensors reading before crash (5 second increments): amdgpu-pci-2800 Adapter: PCI adapter vddgfx: +1.01 V fan1: 1560 RPM (min = 0 RPM, max = 3200 RPM) edge: +69.0°C (crit = +118.0°C, hyst = -273.1°C) (emerg = +99.0°C) junction: +84.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) mem: +80.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) power1: 227.00 W (cap = 195.00 W) k10temp-pci-00c3 Adapter: PCI adapter Tdie: +71.8°C (high = +70.0°C) Tctl: +71.8°C Created attachment 145918 [details]
Journal excerpt vega56 ring gfx timeout, then gpu reset
I think I'm having this problem on a Vega 56, I didn't see anyone else mention that card here.
I attached the relevant log, I think it's this same issue, but someone correct me if I'm wrong.
OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.33.0, 5.3.0-20-generic, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.1
Running Pop!_OS:
Linux robo-triangulum 5.3.0-20-generic #21+system76~1572304854~19.10~8caa3e6-Ubuntu SMP Tue Oct 29 00:4 x86_64 x86_64 x86_64 GNU/Linux
(In reply to Ben Klein from comment #25) > Created attachment 145918 [details] > Journal excerpt vega56 ring gfx timeout, then gpu reset > > I think I'm having this problem on a Vega 56, I didn't see anyone else > mention that card here. > > I attached the relevant log, I think it's this same issue, but someone > correct me if I'm wrong. > > OpenGL renderer string: Radeon RX Vega (VEGA10, DRM 3.33.0, > 5.3.0-20-generic, LLVM 9.0.0) > OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.1 > > Running Pop!_OS: > Linux robo-triangulum 5.3.0-20-generic > #21+system76~1572304854~19.10~8caa3e6-Ubuntu SMP Tue Oct 29 00:4 x86_64 > x86_64 x86_64 GNU/Linux Could be, there are a few patches in latest RADV, so try out MESA 20.0 git to see if it fixes anything for you... apparently radv hangs for navi gpus stopped with that fix. This doesn't seem to be exclusive to Navi GPUs, I've been having instances of ring gfx timeouts freezing up the system in numerous games such as Project Zomboid (was recently fixed by the developer) and ArmA 3 with the all too familiar dmesg: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted! drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered I'm using: Radeon RX 590 Series (POLARIS10, DRM 3.33.0, 5.3.8-arch1-1, LLVM 9.0.0) I think this bug report can be closed now, Mesa 20 git basically fixes radv related ring_gfx hangs, there is still hang that happens in Citra emulator (ngg related) but AMD developers are aware of it so will probably get fixed too. (In reply to Marko Popovic from comment #28) > I think this bug report can be closed now, Mesa 20 git basically fixes radv > related ring_gfx hangs, there is still hang that happens in Citra emulator > (ngg related) but AMD developers are aware of it so will probably get fixed > too. Yeah.. "soon". Still waiting for them to fix bug 111481 (In reply to Daniel Suarez from comment #29) > (In reply to Marko Popovic from comment #28) > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > (ngg related) but AMD developers are aware of it so will probably get fixed > > too. > > Yeah.. "soon". Still waiting for them to fix bug 111481 SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv related and are fixed now (In reply to Marko Popovic from comment #30) > (In reply to Daniel Suarez from comment #29) > > (In reply to Marko Popovic from comment #28) > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > > (ngg related) but AMD developers are aware of it so will probably get fixed > > > too. > > > > Yeah.. "soon". Still waiting for them to fix bug 111481 > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv > related and are fixed now Still, I can't even play Vulkan titles reliably because the system constantly hangs even with the workarounds in the bug report. AMD really needs to fix them. (In reply to Daniel Suarez from comment #31) > (In reply to Marko Popovic from comment #30) > > (In reply to Daniel Suarez from comment #29) > > > (In reply to Marko Popovic from comment #28) > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > > > (ngg related) but AMD developers are aware of it so will probably get fixed > > > > too. > > > > > > Yeah.. "soon". Still waiting for them to fix bug 111481 > > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv > > related and are fixed now > > Still, I can't even play Vulkan titles reliably because the system > constantly hangs even with the workarounds in the bug report. AMD really > needs to fix them. Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled fully so you can't get any hangs that are SDMA related. (In reply to Marko Popovic from comment #32) > (In reply to Daniel Suarez from comment #31) > > (In reply to Marko Popovic from comment #30) > > > (In reply to Daniel Suarez from comment #29) > > > > (In reply to Marko Popovic from comment #28) > > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > > > > (ngg related) but AMD developers are aware of it so will probably get fixed > > > > > too. > > > > > > > > Yeah.. "soon". Still waiting for them to fix bug 111481 > > > > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv > > > related and are fixed now > > > > Still, I can't even play Vulkan titles reliably because the system > > constantly hangs even with the workarounds in the bug report. AMD really > > needs to fix them. > > Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled > fully so you can't get any hangs that are SDMA related. That workaround delays the hangs af best, and I have gotten hangs from OpenGl Games and also by using amdvlk. Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm just saying that you saying "soon" is very misleading. AMD hasn't still properly fixed bugs that lead to hangs by just watching Firefox, and it's been MONTHS. "soon" for them is months apperantly (In reply to Daniel Suarez from comment #33) > (In reply to Marko Popovic from comment #32) > > (In reply to Daniel Suarez from comment #31) > > > (In reply to Marko Popovic from comment #30) > > > > (In reply to Daniel Suarez from comment #29) > > > > > (In reply to Marko Popovic from comment #28) > > > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > > > > > (ngg related) but AMD developers are aware of it so will probably get fixed > > > > > > too. > > > > > > > > > > Yeah.. "soon". Still waiting for them to fix bug 111481 > > > > > > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv > > > > related and are fixed now > > > > > > Still, I can't even play Vulkan titles reliably because the system > > > constantly hangs even with the workarounds in the bug report. AMD really > > > needs to fix them. > > > > Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled > > fully so you can't get any hangs that are SDMA related. > > That workaround delays the hangs af best, and I have gotten hangs from > OpenGl Games and also by using amdvlk. > > Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm > just saying that you saying "soon" is very misleading. AMD hasn't still > properly fixed bugs that lead to hangs by just watching Firefox, and it's > been MONTHS. "soon" for them is months apperantly And where exactly did I say soon? (In reply to Marko Popovic from comment #34) > (In reply to Daniel Suarez from comment #33) > > (In reply to Marko Popovic from comment #32) > > > (In reply to Daniel Suarez from comment #31) > > > > (In reply to Marko Popovic from comment #30) > > > > > (In reply to Daniel Suarez from comment #29) > > > > > > (In reply to Marko Popovic from comment #28) > > > > > > > I think this bug report can be closed now, Mesa 20 git basically fixes radv > > > > > > > related ring_gfx hangs, there is still hang that happens in Citra emulator > > > > > > > (ngg related) but AMD developers are aware of it so will probably get fixed > > > > > > > too. > > > > > > > > > > > > Yeah.. "soon". Still waiting for them to fix bug 111481 > > > > > > > > > > SDMA hangs have nothing to do with ring_gfx hangs which were mostly radv > > > > > related and are fixed now > > > > > > > > Still, I can't even play Vulkan titles reliably because the system > > > > constantly hangs even with the workarounds in the bug report. AMD really > > > > needs to fix them. > > > > > > Mesa 20.0 should fix Vulkan hangs for you, and with nodma SDMA is disabled > > > fully so you can't get any hangs that are SDMA related. > > > > That workaround delays the hangs af best, and I have gotten hangs from > > OpenGl Games and also by using amdvlk. > > > > Don't get me wrong I'm not saying this bug report shouldn't be closed, I'm > > just saying that you saying "soon" is very misleading. AMD hasn't still > > properly fixed bugs that lead to hangs by just watching Firefox, and it's > > been MONTHS. "soon" for them is months apperantly > > And where exactly did I say soon? My bad, I read "soon" instead of "too", apologies Also, for people who have a 5700XT card, check if yours has dual BIOS's Typically one is for running at normal clock speeds, and the other is for running overclocked values. My card, the Powercolor Red Devil 5700XT, is an example of such card, in OC mode I have had all sorts of random freezes and crashes in both Windows AND Linux. Since switching to the default clocks, sometimes called Silent mode. I haven't had a single problem since. This is just a heads up for users who have Navi10 based cards with a selectable BIOS (In reply to Daniel Suarez from comment #33) > That workaround delays the hangs af best, and I have gotten hangs from > OpenGl Games and also by using amdvlk. > Those hangs shouldn't be SDMA related, however. If you are getting hangs from specific games, report them on the corresponding bug tracker (https://gitlab.freedesktop.org/mesa/mesa for OGL and RADV, https://github.com/GPUOpen-Drivers/AMDVLK/issues for AMDVLK). I suggest using RADV_PERFTEST=aco with mesa-git for the most stable Vulkan experience (or try the AMDGPU-PRO Vulkan driver). There's also the "divide error" random hang issue, but it shouldn't be related to SDMA either. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/914. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.