Bug 111481 - AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9
Summary: AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa...
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: highest critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
: 111759 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-08-25 00:50 UTC by Marko Popovic
Modified: 2019-11-19 09:50 UTC (History)
29 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Merge last adg5f code (22.82 MB, patch)
2019-08-31 22:15 UTC, Mathieu Belanger
no flags Details | Splinter Review
APITrace log from Citra crash (56.68 MB, application/octet-stream)
2019-09-02 08:25 UTC, Marko Popovic
no flags Details
APITrace log from RocketLeague crash (581.72 KB, application/octet-stream)
2019-09-02 09:13 UTC, Marko Popovic
no flags Details
wip patch (1.01 KB, patch)
2019-09-10 15:23 UTC, Pierre-Eric Pelloux-Prayer
no flags Details | Splinter Review
UMR dump of registers on a GPU lockup (187.27 KB, text/plain)
2019-09-10 18:25 UTC, Alexandr Kára
no flags Details
umr output of sdma0/sdma1 after RotTR freeze (20.22 KB, application/gzip)
2019-09-10 21:02 UTC, Sebastian Meyer
no flags Details
Log of divide error (4.46 KB, text/plain)
2019-09-19 20:11 UTC, Matthias Müller
no flags Details
Additional log of divide error (4.27 KB, text/plain)
2019-09-20 06:27 UTC, Doug Ty
no flags Details
dump of the sdma0 ring after a timeout error (92.75 KB, text/plain)
2019-10-06 19:20 UTC, Sebastian Meyer
no flags Details
sdma read delay (1.38 KB, patch)
2019-10-14 10:09 UTC, Pierre-Eric Pelloux-Prayer
no flags Details | Splinter Review
APITrace from Rocket League successful launch (343.95 MB, text/plain)
2019-10-17 19:31 UTC, Marko Popovic
no flags Details
umr output after sdma0 timeout (92.08 KB, text/plain)
2019-10-19 05:49 UTC, Sebastian Meyer
no flags Details
sdma0 after apitrace crash (139.45 KB, text/plain)
2019-10-23 17:19 UTC, yamagi
no flags Details
sdma0 after q2 crash (92.08 KB, text/plain)
2019-10-23 17:20 UTC, yamagi
no flags Details
captured GCVM_L2_PROTECTION_FAULT errors in the log. This was captured on 5.4(rc) kernel. (4.11 KB, text/plain)
2019-10-24 13:25 UTC, L.S.S.
no flags Details
Newly captured GCVM_L2_PROTECTION_FAULT errors. This was captured on 5.4(rc) kernel, and with AMD_DEBUG=nodma. (6.28 KB, text/plain)
2019-10-25 13:16 UTC, L.S.S.
no flags Details
Errors captured with amdgpu.gpu_recovery=1 (13.48 KB, text/plain)
2019-10-27 03:10 UTC, L.S.S.
no flags Details
Trace file from Blender SDMA hang (34.89 MB, application/octet-stream)
2019-11-04 20:21 UTC, Marko Popovic
no flags Details
dmesg with gpu recovery enabled (247.49 KB, text/plain)
2019-11-06 19:41 UTC, Marco Liedtke
no flags Details
dmesg of new sdma0 error while watching youtube with firefox, mainline kernel 5.3.9, padoka ppa mesa 19.3 (96.58 KB, text/plain)
2019-11-08 21:57 UTC, Marco Liedtke
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marko Popovic 2019-08-25 00:50:43 UTC
I've tried my AMD Radeon RX 5700 XT on both ubuntu (llvm 9 / mesa 19.3 - Oibaf PPA) and Manjaro (llvm 10 git / mesa-git).
On both I've been using Gnome shell and in both cases I had frequent lockups and freezes. Once my GPU disconnected to Monitor and remained so until I rebooted, other times desktop would just freeze and crash the whole system.

Software tried: LLVM 10 git / MESA 19.3 - git on Manjaro
                LLVM 9 / MESA 19.3 git from Oibaf PPA
Kernels tried: Manjaro 5.3 RC4, Ubuntu 5.3 RC5 generic, Ubuntu drm-tip 5.3 daily

Error log:
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for fences timed out or interrupted!
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=94235, emitted seq=94237
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process citra-qt pid 27356 thread citra-qt:cs0 pid 27366

Happened on all setups, bug was pretty much the same, lockups weren't extremely frequent but frequent enough that they were very noticable (5-6 freezes per day on average)

Faulty hardware is probably out of options since I never had a hiccup or anything even close to crash or freeze on my Windows desktop.
Comment 1 Marko Popovic 2019-08-25 17:10:52 UTC
Adding error log from Manjaro:
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=1742, emitted seq=1743
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 975 thread gnome-shell:cs0 pid 988
avg 23 16:05:37 Marko-PC kernel: [drm] GPU recovery disabled.

Pretty much same-type error happens in different situations and very often at random while using the desktop. These 2 logs one is from launching an OpenGL from Citra emulator which is reproducable every time and the second one from Manjaro is while browsing the Gnome shell and it would crash without any clear triggers.
Comment 2 Mathieu Belanger 2019-08-28 15:39:43 UTC
I confirm that I have this bug or a very similar one.

It, for some reason, happens most when i'm using my IDE (Intellij based).
It will append the most when I type code and the crash occur when the IDE is supposed to propose some code completion.

I do have one to two crash a day.

Video card is RX5700
CPU is Ryzen R7-2700X

Software tested LLVM 9 git
libdrm, mesa, ddx updated from GIT very frequently.

Bug is there since I have the card, like 3 weeks ago.
Comment 3 Matthias Müller 2019-08-30 22:07:20 UTC
I don't know if i'm encountering the same bug, but it is at least similar.
I don't get hard freezes/lockups, but i get a strange "stutterting", as if the whole OS halted for a few seconds, then continued for a few seconds...and the halted times grew while the "usable seconds" got shorter quickly to the point of unusability...

It doesn't happen regularly (seems like anything between 30min and 120min) and i haven't yet made out a direct cause, but in journalctl, it seems the same messages appear every time when it begins:

kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
 kernel: amdgpu 0000:0f:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0)
 kernel: amdgpu 0000:0f:00.0:   at page 0x0000600000fd6000 from 18
 kernel: amdgpu 0000:0f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00041152

after that there are a lot of these:

kernel: amdgpu: [powerplay] Failed to send message 0x40, response 0xffffffc2 param 0x2
kernel: amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80

until shutdown/hardreset.

Maybe some observation that might help to narrow it down:
The first time it occured, i had to do a few reboots that showed this behaviour right after startup until it finally worked again - for about 45min.
As it didn't work again after around 10 reboots, i tried uninstalling corectrl (that i used to have a custom fan-curve) - and it finally booted normal again!
I then installed radeon-profile to have fan-controll (i don't want to have the fans stand still on desktop, as the card gets over 80° C hot before the fans kick in...).
The issue still occurs with radeon-profile, but at least every reboot is running fine...
Other thing i noticed is that after the first "freeze" with radeon-profile lm_sensors stopped reporting the fanspeed for the card, it always stays at zero.

So maybe it is related to fan-control or the sysfs interface in general?
Comment 4 Matthias Müller 2019-08-30 22:13:50 UTC
Forgot to mention: running Manjaro 5.3rc6.d0826.ga55aa89-1, mesa-git 1:19.3.0_devel.114849.0142dcb990e-1 and llvm-libs-git 10.0.0_r325376.70e158e09e9-1
And if it matters: firmware from https://aur.archlinux.org/packages/linux-firmware-agd5f-radeon-navi10/ v2019.08.26.14.36-1
Comment 5 Mathieu Belanger 2019-08-30 23:20:18 UTC
It probably really depend of what we do on our desktop. I just remember now how I did stop using FileZilla since I got that GPU as it was crashing almost all the time I was using it (Like I never not crashed while that thing was open and running). Still use it for work but I keep it to minimum (open, upload, close) instead of keeping it running.
Comment 6 Alexandr Kára 2019-08-31 07:14:11 UTC
Might be related to https://bugs.freedesktop.org/show_bug.cgi?id=111269. I also get the "ring gfx_0.0.0 timeout" error (but not the "ring sdma0 timeout" error). 

Using LLVM from git + Mesa 19.2.0-rc1 on Fedora 30 with kernel from Fedora 31 (5.3.0-0.rc5.git0.1.fc31.x86_64). GPU AMD Radeon RX 5700 XT, CPU AMD Ryzen 7 1700, 32 GB RAM (EDD).
Comment 7 Mathieu Belanger 2019-08-31 22:15:36 UTC
Created attachment 145225 [details] [review]
Merge last adg5f code

Ok, I did look at the recent kernel patch and commit and they seam to have fixed a couple bugs. I do not know it it include these but I did not crash one time since I merged that into the kernel 5.3-rc6. (that code is staged for 5.4 merge window).

I did attach the patch so you can merge that if you wish to try. It add all the latest bits for AMDGPU into 5.3-rc6, including Renoir support.
Comment 8 Marko Popovic 2019-08-31 22:18:51 UTC
(In reply to Mathieu Belanger from comment #7)
> Created attachment 145225 [details] [review] [review]
> Merge last adg5f code
> 
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
> 
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

How do I merge the patch myself? :) I'd like to try it
Comment 9 Matthias Müller 2019-08-31 23:50:35 UTC
On my side i can report that the issue does not occur if i don't use a tool to modify the FANs - does anyone of you use something of the like or are this seperate issues?
Comment 10 Marko Popovic 2019-09-01 00:36:02 UTC
(In reply to Matthias Müller from comment #9)
> On my side i can report that the issue does not occur if i don't use a tool
> to modify the FANs - does anyone of you use something of the like or are
> this seperate issues?

I don't use any tools, all is stock.

(In reply to Mathieu Belanger from comment #7)
> Created attachment 145225 [details] [review] [review]
> Merge last adg5f code
> 
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
> 
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

After applying the patch, same type of error occurs, luckily very easy to reproduce with Citra emulator, apparently it does something that AMD's driver really doesn't like and makes chances higher for error to occur. Also when CPU is under heavy I/O load error seems more likely to occur as well on my end.

Last log after applying the latest patch from the merge posted in the attachment:
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.

If we could get any official AMD responses to at least make sure that we're at least being listened to would be very nice.
Comment 11 Marko Popovic 2019-09-01 10:24:04 UTC
Same bug is also reproducable when launching native version of Rocket League.

Here are the logs:
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:   in page starting at address 0x0000000000fff000 from client 27
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00001B3C
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          MORE_FAULTS: 0x0
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          WALKER_ERROR: 0x6
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          PERMISSION_FAULTS: 0x3
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          MAPPING_ERROR: 0x1
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          RW: 0x0
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=7198, emitted seq=7200
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RocketLeague pid 3035 thread RocketLeag:cs0 pid 3042
Comment 12 Mathieu Belanger 2019-09-01 16:36:37 UTC
I did not crash and have a > 24h uptime.

I could not test Citra as I don't have a 3DS and the roms I found are encrypted..

I could not test on Rocket League as it would require me to spend for a game I will not play.

I will continue to test later today.
Comment 13 Mathieu Belanger 2019-09-02 06:05:20 UTC
(In reply to Marko Popovic from comment #10)
> (In reply to Matthias Müller from comment #9)
> > On my side i can report that the issue does not occur if i don't use a tool
> > to modify the FANs - does anyone of you use something of the like or are
> > this seperate issues?
> 
> I don't use any tools, all is stock.
> 
> (In reply to Mathieu Belanger from comment #7)
> > Created attachment 145225 [details] [review] [review] [review]
> > Merge last adg5f code
> > 
> > Ok, I did look at the recent kernel patch and commit and they seam to have
> > fixed a couple bugs. I do not know it it include these but I did not crash
> > one time since I merged that into the kernel 5.3-rc6. (that code is staged
> > for 5.4 merge window).
> > 
> > I did attach the patch so you can merge that if you wish to try. It add all
> > the latest bits for AMDGPU into 5.3-rc6, including Renoir support.
> 
> After applying the patch, same type of error occurs, luckily very easy to
> reproduce with Citra emulator, apparently it does something that AMD's
> driver really doesn't like and makes chances higher for error to occur. Also
> when CPU is under heavy I/O load error seems more likely to occur as well on
> my end.
> 
> Last log after applying the latest patch from the merge posted in the
> attachment:
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]
> *ERROR* Waiting for fences timed out!
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
> sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.
> 
> If we could get any official AMD responses to at least make sure that we're
> at least being listened to would be very nice.

I was able to reproduce that Citra crash.
Followed the instruction, it did crash instantly after choosing continue (or a fraction of a second after, the music lagged a lil and complete system crash (was able so sync/umount/reboot with the magics key)).

Is your crash exactly at the same place? If so then it's very reproducible and  it might be a good idea to run a opengl trace to see what commands was sent last to provoke the crash.

I am not familiar with the Ubuntu stuff, is these got compiled on your system? if no do you know the build date of your Mesa, libdrm and xf86-video-amdgpu (x11 ddx).

Also can you tell what microcode files dates you do have?

Libdrm : 07:49:10 PM 08/27/2019
Mesa : 05:37:07 PM 08/30/2019
Xorg amdgpu DDX : 07:55:17 PM 08/27/2019

The microcode files where not available on my distribution when I installed them. I did download/install them on August 6 but they where from July 15 ish I think, I remember that the latest microcode at that time where crashing with a black screen on module load and that's why I did install an older version.
Comment 14 Marko Popovic 2019-09-02 07:24:16 UTC
(In reply to Mathieu Belanger from comment #13) 
> I was able to reproduce that Citra crash.
> Followed the instruction, it did crash instantly after choosing continue (or
> a fraction of a second after, the music lagged a lil and complete system
> crash (was able so sync/umount/reboot with the magics key)).
> 
> Is your crash exactly at the same place? If so then it's very reproducible
> and  it might be a good idea to run a opengl trace to see what commands was
> sent last to provoke the crash.
> 
> I am not familiar with the Ubuntu stuff, is these got compiled on your
> system? if no do you know the build date of your Mesa, libdrm and
> xf86-video-amdgpu (x11 ddx).
> 
> Also can you tell what microcode files dates you do have?
> 
> Libdrm : 07:49:10 PM 08/27/2019
> Mesa : 05:37:07 PM 08/30/2019
> Xorg amdgpu DDX : 07:55:17 PM 08/27/2019
> 
> The microcode files where not available on my distribution when I installed
> them. I did download/install them on August 6 but they where from July 15
> ish I think, I remember that the latest microcode at that time where
> crashing with a black screen on module load and that's why I did install an
> older version.

Yes, always happens at the same place with Citra emulator, however what bothers me more about the bug is that sometimes it happens completely randomly on my system without any really obvious triggers while just browsing and using my desktop, so it's not Citra exclusive, but luckily I've found the Citra method to provode the bug so we can do more detailed logging.

Further observations:
- Bug is the same-type as other crashes and is not Citra emulator exclusive, happens on Rocket League on launch as well and sometimes randomly while using the desktop
- Same type of crash IS NOT reproducable on Windows on the same GPU
- Same type of bug IS NOT reproducable on my IntelHD laptop with same versions of MESA/LLVM which probably means either faulty AMD kernel driver or faulty Firmware binaries.

My versions are:
MESA: Mesa 19.3.0-devel (git-6775a52 2019-09-02 eoan-oibaf-ppa)
Kernel: Ubuntu mainline 5.3 daily build (I ALSO tried amd-drm-next-5.4, same bug is reproducable)
Firmware binaries: 2019-08-26 from /~agd5f/radeon_ucode/navi10
Comment 15 Pierre-Eric Pelloux-Prayer 2019-09-02 08:01:39 UTC
(In reply to Marko Popovic from comment #14)
> 
> Yes, always happens at the same place with Citra emulator

Could you capture a trace of the problem (using Apitrace or Renderdoc)?

This would be very helpful to fix it.
Comment 16 Marko Popovic 2019-09-02 08:25:17 UTC
Created attachment 145232 [details]
APITrace log from Citra crash
Comment 17 Marko Popovic 2019-09-02 08:26:32 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> (In reply to Marko Popovic from comment #14)
> > 
> > Yes, always happens at the same place with Citra emulator
> 
> Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> 
> This would be very helpful to fix it.

I added reproduced Citra crash recorded by using command:
apitrace trace ./citra-qt

I hope this is correct, if you need anything else or done differently please just let me know!
Comment 18 Marko Popovic 2019-09-02 09:13:48 UTC
Created attachment 145233 [details]
APITrace log from RocketLeague crash

I am adding Rocket League crash output from apitrace.
Comment 19 Pierre-Eric Pelloux-Prayer 2019-09-02 11:53:41 UTC
(In reply to Marko Popovic from comment #17)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> > (In reply to Marko Popovic from comment #14)
> > > 
> > > Yes, always happens at the same place with Citra emulator
> > 
> > Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> > 
> > This would be very helpful to fix it.
> 
> I added reproduced Citra crash recorded by using command:
> apitrace trace ./citra-qt
> 
> I hope this is correct, if you need anything else or done differently please
> just let me know!

Thanks for the trace!

Replaying the trace a few times is enough to reliably to reproduce the hang.

Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary workaround until a proper fix is found.
Could you confirm this on your system?


> 
> I am adding Rocket League crash output from apitrace.

This trace file is very small (only one frame) and doesn't hang here.
Comment 20 Marko Popovic 2019-09-02 12:24:49 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #19)
> Thanks for the trace!
> 
> Replaying the trace a few times is enough to reliably to reproduce the hang.
> 
> Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary
> workaround until a proper fix is found.
> Could you confirm this on your system?
> 
> 
> > 
> > I am adding Rocket League crash output from apitrace.
> 
> This trace file is very small (only one frame) and doesn't hang here.

Thanks for the workaround! Here are my results:

-AMD_DEBUG=nongg works to fix the Citra-related crash

- It doesn't work to fix Rocket League related hang, that seems to be a completely different beast... the GPU hang happens as well but I don't know why, apparently apitrace doesn't provide any useful information as to why it happens.

Now I will continue testing to see whether citra-related crash workaround also works for my desktop random freezes and hangs and will report back. I added AMD_DEBUG=nongg to my /etc/environment so it should be applied to desktop as well.
Comment 21 Marko Popovic 2019-09-02 16:45:09 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #19)
> (In reply to Marko Popovic from comment #17)
> > (In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> > > (In reply to Marko Popovic from comment #14)
> > > > 
> > > > Yes, always happens at the same place with Citra emulator
> > > 
> > > Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> > > 
> > > This would be very helpful to fix it.
> > 
> > I added reproduced Citra crash recorded by using command:
> > apitrace trace ./citra-qt
> > 
> > I hope this is correct, if you need anything else or done differently please
> > just let me know!
> 
> Thanks for the trace!
> 
> Replaying the trace a few times is enough to reliably to reproduce the hang.
> 
> Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary
> workaround until a proper fix is found.
> Could you confirm this on your system?
> 
> 
> > 
> > I am adding Rocket League crash output from apitrace.
> 
> This trace file is very small (only one frame) and doesn't hang here.

Okay I just got another random hang on the desktop. even with the environment variable turned on the whole time. Unfortunately it seems to be very hardly tracable seems to be very random :( Seems that Citra hang is unrelated to this bug after all, it's a completely different bug. It's good that we discovered another (citra-related) bug on the way but probably we can't mark that workaround to solve anything because hangs still randomly occur on the desktop.
Comment 22 Pierre-Eric Pelloux-Prayer 2019-09-02 17:01:52 UTC
> Okay I just got another random hang on the desktop. even with the
> environment variable turned on the whole time. Unfortunately it seems to be
> very hardly tracable seems to be very random :( Seems that Citra hang is
> unrelated to this bug after all, it's a completely different bug. It's good
> that we discovered another (citra-related) bug on the way but probably we
> can't mark that workaround to solve anything because hangs still randomly
> occur on the desktop.

Yes, it's possible that there are different bugs.

For the citra bug: I suspect an issue with Geometry Shaders + NGG but this will require more debugging to confirm (also: using wavesize=64 didn't help, so it's not a regression caused by a0d330bedb9e).

I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the sdma0 kind of hangs.
Comment 23 Marko Popovic 2019-09-02 17:05:49 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #22)
> > Okay I just got another random hang on the desktop. even with the
> > environment variable turned on the whole time. Unfortunately it seems to be
> > very hardly tracable seems to be very random :( Seems that Citra hang is
> > unrelated to this bug after all, it's a completely different bug. It's good
> > that we discovered another (citra-related) bug on the way but probably we
> > can't mark that workaround to solve anything because hangs still randomly
> > occur on the desktop.
> 
> Yes, it's possible that there are different bugs.
> 
> For the citra bug: I suspect an issue with Geometry Shaders + NGG but this
> will require more debugging to confirm (also: using wavesize=64 didn't help,
> so it's not a regression caused by a0d330bedb9e).
> 
> I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the
> sdma0 kind of hangs.

Yes both Rocket League and Desktop hangs seem to be the sdma0 type. I will add that parameter as well and see if there is any difference with Rocket League hang and use the desktop with both flags enabled.
 
Well I mean actually finding multiple bugs while debugging 1 can only be a good thing, after all less bugs in the future and my personal computing seems to have quite a few corner cases it seems that otherwise go unnoticed :D which should benefit many new happy Navi users
Comment 24 Marko Popovic 2019-09-02 17:16:13 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #22)
> Yes, it's possible that there are different bugs.
> 
> For the citra bug: I suspect an issue with Geometry Shaders + NGG but this
> will require more debugging to confirm (also: using wavesize=64 didn't help,
> so it's not a regression caused by a0d330bedb9e).
> 
> I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the
> sdma0 kind of hangs.

Ok, I confirm that AMD_DEBUG=nodma gets rid of Rocket-League startup crash, will report about the desktop stability for the rest of the day!
Comment 25 Mathieu Belanger 2019-09-03 14:56:26 UTC
I confirm that a system wide nongg do not fix random surprise crash I get on filezilla and phpstorm.

Switching to system wide nodma (that sound scary on the performance side)
Comment 26 Marko Popovic 2019-09-04 12:20:23 UTC
(In reply to Mathieu Belanger from comment #25)
> I confirm that a system wide nongg do not fix random surprise crash I get on
> filezilla and phpstorm.
> 
> Switching to system wide nodma (that sound scary on the performance side)

Yes but that unfortunately is exactly what "solved" the sdma0 freezes for me. Let's hope that a proper fix comes as soon as possible!
Comment 27 Mathieu Belanger 2019-09-04 12:24:44 UTC
It did fix it for me too.
Comment 28 Pierre-Eric Pelloux-Prayer 2019-09-04 15:36:06 UTC
Regarding sdma ring hangs: if you still have access to the affected machine using ssh, it would be helpful to add a comment with the following information:

  - the last dmesg lines (at least the "[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=9871, emitted seq=9873" one)
  - the output of : umr -R sdma0 (or sdma1 depending on which one failed)

Thanks!
Comment 29 Marko Popovic 2019-09-05 11:14:41 UTC
(In reply to Mathieu Belanger from comment #27)
> It did fix it for me too.

(In reply to Pierre-Eric Pelloux-Prayer from comment #28)
> Regarding sdma ring hangs: if you still have access to the affected machine
> using ssh, it would be helpful to add a comment with the following
> information:
> 
>   - the last dmesg lines (at least the "[drm:amdgpu_job_timedout [amdgpu]]
> *ERROR* ring sdma1 timeout, signaled seq=9871, emitted seq=9873" one)
>   - the output of : umr -R sdma0 (or sdma1 depending on which one failed)
> 
> Thanks!

Mathieu could you assist Pierre-Eric with this? 
I am currently on vacation and won't be able to debug or test further until 15th of September.
Comment 30 Mathieu Belanger 2019-09-05 11:50:19 UTC
I will disable the workaround friday after work.

Then I will report when it will crash.
Comment 31 Mathieu Belanger 2019-09-06 01:58:37 UTC
Is that patch set https://lists.freedesktop.org/archives/amd-gfx/2019-September/039593.html relate to this ?

Graceful page fault handling for Vega/Navi
Comment 32 Sebastian Meyer 2019-09-10 14:19:25 UTC
Having the same issues with my new Powercolor RX 5700 XT on Arch Linux.
System freezes after a couple of seconds when I try to run games like RotTR. Other games I've tested, like Dota 2 for example, are unreliable and make the system freeze after a few of minutes or after an hour or so.

The dmesg output when SSHing into my system:
[65070.475185] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65070.475259] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595093] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595180] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595260] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6662176, emitted seq=6662178
[65075.595322] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 56804 thread RiseOfTheT:cs0 pid 56811
[65075.595324] [drm] GPU recovery disabled.

I've also had a couple of sdma0/sdma1 related freezes after opening resource-heavy websites in Chromium. Unfortunately though, I'm unable to reproduce it now. If the system freezes again, I will provide logs and umr output, as requested. The website which caused most of the freezes was izurvive.com (interactive DayZ map) and it froze while toggling map markers on and off.
Sep 08 17:49:52 basti-pc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=2372, emitted seq=2375
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium pid 1271 thread chromium:cs0 pid 1331

$ pacman -Q linux-mainline linux-firmware-agd5f-radeon-navi10 {,lib32-}{mesa-git,vulkan-radeon-git,llvm-git,libdrm-git}
linux-mainline 5.3rc8-1
linux-firmware-agd5f-radeon-navi10 2019.08.26.14.36-1
mesa-git 1:19.3.0_devel.115190.f83f9d7daa0-1
lib32-mesa-git 1:19.3.0_devel.115190.f83f9d7daa0-1
vulkan-radeon-git 1:19.3.0_devel.115190.f83f9d7daa0-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115190.f83f9d7daa0-1
llvm-git 10.0.0_r326348.d7d8bb937ad-1
lib32-llvm-git 10.0.0_r326355.d065c811649-1
libdrm-git 2.4.99.r17.g10cd9c3d-1
lib32-libdrm-git 2.4.99.r17.g10cd9c3d-1
Comment 33 Pierre-Eric Pelloux-Prayer 2019-09-10 15:23:51 UTC
Created attachment 145323 [details] [review]
wip patch

You can give a try to the attached kernel patch which hopefully could prevent some sdma timeouts.

I'm still testing it but the more testers the better :)
Comment 34 Mathieu Belanger 2019-09-10 15:36:52 UTC
Patch applied

Removed nodma from the /etc/environment

Will reboot at lunch time, Usually my IDEs trigger the crash. Will see how it go.
Comment 35 Alexandr Kára 2019-09-10 18:25:07 UTC
Created attachment 145324 [details]
UMR dump of registers on a GPU lockup

Sending dmesg output + UMR registers dump of both sdma0 and sdma1 for a lockup in Rise of the Tomb Raider.

[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=104586, emitted seq=104588
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 8457 thread RiseOfTheT:cs0 pid 8463
[drm] GPU recovery disabled.

The lockup is reproducible and only affects the GPU - it's still fine to ssh to the machine and it's otherwise working fine.
Comment 36 Sebastian Meyer 2019-09-10 21:02:44 UTC
Created attachment 145326 [details]
umr output of sdma0/sdma1 after RotTR freeze

Applied the provided WIP patch to linux-mainline 5.3-rc8 and started RotTR again in order to trigger a system freeze.
This time I also got a ring sdma0 and sdma1 timeout:

[  632.175837] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  632.175973] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  637.299049] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=313757, emitted seq=313759
[  637.299110] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 2584 thread RiseOfTheT:cs0 pid 2590
[  637.299111] [drm] GPU recovery disabled.
[  646.468871] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=278259, emitted seq=278263
[  646.468961] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=21116, emitted seq=21119
[  646.469052] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  646.469141] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 989 thread plasmashel:cs0 pid 1155
[  646.469141] [drm] GPU recovery disabled.
[  646.469142] [drm] GPU recovery disabled.

Stdout of `umr -R sdma0` and `umr -R sdma1` is attached to this post, however, I also got a couple of stderr messages like "[ERROR]: No valid mapping for 3@800000023f00" which I didn't include in the output.
Comment 37 Jeremy Silliman 2019-09-12 12:21:35 UTC
I purchased a 5700XT the other day, and what I've noticed is that anything that tries getting statistics from the GPU (radeontop, lm_sensors) induces a page fault hang within a couple of minutes. In my testing I either ran lm_sensors every three seconds, or radeontop, and left it idle while playing a game or watching a video, and without fail, a hang would happen shortly after. As soon as I stopped running either of those programs the hangs stopped. This may work as a reproducible test case for some of the hangs.
Comment 38 Mathieu Belanger 2019-09-13 05:22:47 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #33)
> Created attachment 145323 [details] [review] [review]
> wip patch
> 
> You can give a try to the attached kernel patch which hopefully could
> prevent some sdma timeouts.
> 
> I'm still testing it but the more testers the better :)

So far so good. Your patch seam to have fixed the "random" crash That I was able to replicate when I was loading my 3 many tabs browsers and phpstorm in the same time and I can use my IDE without crashing too.

Maybe I got really lucky too. But it's been more than a day without crash and without the nodma "fix"
Comment 39 Shmerl 2019-09-15 02:41:08 UTC
I also get such freezes when opening a new tab in Firefox (once in a while), and when using ksysguard to read amdgpu sensors with Sapphire Pulse RX 5700 XT. I'm going to try this patch.
Comment 40 Shmerl 2019-09-15 07:52:39 UTC
With that patch, I get stutters, but not hard freeze when using ksysguard with reading amdgpu sensors. I see such errors in dmesg when that happens:

14889.400985] amdgpu: [powerplay] Failed to export SMU metrics table!
[14890.311391] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14891.933714] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14892.785612] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14892.785615] amdgpu: [powerplay] Failed to export SMU metrics table!
[14894.406389] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14894.406393] amdgpu: [powerplay] Failed to export SMU metrics table!
[14895.261140] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14896.937622] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14897.734712] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14897.734714] amdgpu: [powerplay] Failed to export SMU metrics table!
Comment 41 Shmerl 2019-09-15 17:45:47 UTC
Just FYI, I just just used latest firmware from here (2019-09-13): https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/

It didn't make a difference, ksysguard is still causing those powerplay errors.
Comment 42 Marko Popovic 2019-09-15 19:57:58 UTC
Ok I came home from vacation and got my hands on the WIP patch. 

Rocket-League startup SDMA-type freeze is completely gone.

I will continue testing the desktop usage without nodma enabled and will report if it fixes the random SDMA freezes as well :)

Will keep you guys updated.
Comment 43 Marko Popovic 2019-09-15 20:37:11 UTC
(In reply to Marko Popovic from comment #42)
> Ok I came home from vacation and got my hands on the WIP patch. 
> 
> Rocket-League startup SDMA-type freeze is completely gone.
> 
> I will continue testing the desktop usage without nodma enabled and will
> report if it fixes the random SDMA freezes as well :)
> 
> Will keep you guys updated.

Update: Ok NVM, I said it too fast, RL SDMA freezes came back even with the WIP patch applied. Here is the output:

sep 15 22:34:15 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4302, emitted seq=4304
sep 15 22:34:15 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RocketLeague pid 3123 thread RocketLeag:cs0 pid 3130
Comment 44 Marko Popovic 2019-09-15 22:22:43 UTC
Another Update:

Unfortunately even with WIP patch applied I got another random desktop freeze in Gnome shell of SDMA type.

Turned back the nodma tweak in order to avoid having those until further fixes are found.

Another note: I use the most recent firmware libraries and there don't seem to be any improvements on the sdma freezes front.
Comment 45 Mathieu Belanger 2019-09-16 05:31:34 UTC
Just an update : Still no new "random" crash since patch applied.

The only crash I got since patch applied are some partial and recoverable crash that occurred due to insufficient voltage to the overclocked CPU.

So that WIP patch did fix some of the crashs in this bug report but not all.
Comment 46 Marko Popovic 2019-09-16 06:47:05 UTC
(In reply to Mathieu Belanger from comment #45)
> Just an update : Still no new "random" crash since patch applied.
> 
> The only crash I got since patch applied are some partial and recoverable
> crash that occurred due to insufficient voltage to the overclocked CPU.
> 
> So that WIP patch did fix some of the crashs in this bug report but not all.

Unfortunately I wasn't so lucky, I got both ransom and provoked sdma freeze soon afzer disabling the nodma variable :(
Comment 47 Mathieu Belanger 2019-09-16 18:16:25 UTC
Naa, Random crash still occur with FileZilla, so there not totally gone for me. I put nodma back because I use that system for work.
Comment 48 Marko Popovic 2019-09-17 10:23:23 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> (In reply to Marko Popovic from comment #14)
> > 
> > Yes, always happens at the same place with Citra emulator
> 
> Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> 
> This would be very helpful to fix it.

There is another type of freeze/hang happening when playing Starcraft II via D9VK. This one doesn't seem to be related to either ngg or dma because I have them both disabled and the hang occurs anyway.

sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2361623, emitted seq=2361625
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 20236 thread SC2_x64.exe pid 20236

Is there any way to apitrace the Vulkan API?
Comment 49 Shmerl 2019-09-17 21:24:53 UTC
Could be just a similar symptom, but I have a freeze with The Bard's Tale IV with the same error message: https://bugs.freedesktop.org/show_bug.cgi?id=111591

It's going through radeonsi path though.
Comment 50 Timur Kristóf 2019-09-18 13:45:20 UTC
(In reply to Marko Popovic from comment #0)
> Once my GPU disconnected to Monitor and remained so until I
> rebooted

I've seen this problem too and opened a separate bug report about it here:
https://bugs.freedesktop.org/show_bug.cgi?id=111733
Comment 51 Matthias Müller 2019-09-19 20:11:54 UTC
Created attachment 145436 [details]
Log of divide error
Comment 52 Matthias Müller 2019-09-19 20:12:54 UTC
Comment on attachment 145436 [details]
Log of divide error

i just encountered a "random" freeze, too.
And because it seems to be something "new", i thought i'd post it here - seems to be some kind of null pointer from what i found?
Comment 53 Sebastian Meyer 2019-09-20 03:54:22 UTC
Just compiled the latest mainline kernel from a few hours ago with the merge of drm-next-2019-09-18 and tried again.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=574cc4539762561d96b456dbc0544d8898bd4c6e

RotTR is still making the system freeze. I haven't tested other Vulkan applications yet.

[  330.849703] amdgpu 0000:04:00.0: [gfxhub] page fault (src_id:0 ring:24 vmid:3 pasid:32777, for process RiseOfTheTombRa pid 2371 thread RiseOfTheT:cs0 pid 2377)
[  330.849706] amdgpu 0000:04:00.0:   in page starting at address 0x00008000bf066000 from client 27
[  330.849708] amdgpu 0000:04:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00301430
[  330.849709] amdgpu 0000:04:00.0:      MORE_FAULTS: 0x0
[  330.849711] amdgpu 0000:04:00.0:      WALKER_ERROR: 0x0
[  330.849712] amdgpu 0000:04:00.0:      PERMISSION_FAULTS: 0x3
[  330.849713] amdgpu 0000:04:00.0:      MAPPING_ERROR: 0x0
[  330.849715] amdgpu 0000:04:00.0:      RW: 0x0
[  335.967209] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[  335.967290] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[  340.873553] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=73308, emitted seq=73310
[  340.873616] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 2371 thread RiseOfTheT:cs0 pid 2377
[  340.873618] [drm] GPU recovery disabled.
[  341.086869] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!

$ pacman -Q linux-git linux-firmware-agd5f-radeon-navi10 {,lib32-}{mesa-git,vulkan-radeon-git,llvm-git,libdrm-git}
linux-git 5.3.r10169.g574cc4539762-1
linux-firmware-agd5f-radeon-navi10 2019.09.13.18.36-1
mesa-git 1:19.3.0_devel.115529.8b78cce433b-1
lib32-mesa-git 1:19.3.0_devel.115529.8b78cce433b-1
vulkan-radeon-git 1:19.3.0_devel.115529.8b78cce433b-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115529.8b78cce433b-1
llvm-git 10.0.0_r327281.ec841cf36ca-1
lib32-llvm-git 10.0.0_r327289.ed69faa01bf-1
libdrm-git 2.4.99.r23.g0c427545-1
lib32-libdrm-git 2.4.99.r23.g0c427545-1
Comment 54 Doug Ty 2019-09-20 06:27:42 UTC
Created attachment 145439 [details]
Additional log of divide error

(In reply to Matthias Müller from comment #52)
> Comment on attachment 145436 [details]
> Log of divide error
> 
> i just encountered a "random" freeze, too.
> And because it seems to be something "new", i thought i'd post it here -
> seems to be some kind of null pointer from what i found?

I've also been getting this, albeit very rarely. It doesn't seem to happen with older firmware (ie. Jul 14th firmware extracted from Fedora's linux-firmware package), only the newer firmware from the ~agd5f/radeon_ucode repo causes this.

I figured maybe it had to do with PCIe bandwidth, but it occurs on 3.0 as well as 4.0. Even occurs with system-wide AMD_DEBUG=nodma in my /etc/environment. There is no amdgpu error in journalctl, just the above. Screen freezes & I have to reboot with REISUB.

Perhaps this is a different issue, and if so, perhaps we should make a separate bug report for it?
Comment 55 Shmerl 2019-09-20 16:53:07 UTC
Just for the reference, using AMD_DEBUG=nodma with firefox seems to stabilize it for me. So far it didn't hang for a while already.
Comment 56 leo60228 2019-09-20 22:06:33 UTC
I've noticed that AMD_DEBUG=nodma keeps OpenGL applications from crashing, but doesn't help with Vulkan. Does AMD_DEBUG only affect OpenGL, or is that unrelated?
Comment 57 Michael de Lang 2019-09-21 09:47:55 UTC
*** Bug 111759 has been marked as a duplicate of this bug. ***
Comment 58 Matthias Müller 2019-09-21 16:40:16 UTC
(In reply to Doug Ty from comment #54)
> I've also been getting this, albeit very rarely. It doesn't seem to happen
> with older firmware (ie. Jul 14th firmware extracted from Fedora's
> linux-firmware package), only the newer firmware from the
> ~agd5f/radeon_ucode repo causes this.

> 
> Perhaps this is a different issue, and if so, perhaps we should make a
> separate bug report for it?

it only happend twice for me now - but you are right, it started after the last update of the firmware - i can't find the old one and it is hard to test :/

Don't know where to report as the navi-firmware is not in the kernel-firmware, yet?
Comment 59 Mathieu Belanger 2019-09-21 17:22:36 UTC
We have many different bugs by now, I think.

My random crashs issue occurred in early June firmware, as the current one. I did replace June firmware by almost current when I was trying to find what was causing theses crash.
Comment 60 Jeremy Attali 2019-09-22 06:28:14 UTC
Setting AMD_DEBUG=nodma did not work for me. I have an AMD Radeon RX 5700 XT. Experiencing many crashes after a few minutes playing DOOM (OpenGL).

My packages are the following:

> pacman -Q linux linux-firmware-agd5f-radeon-navi10 {,lib32-}{mesa-git,vulkan-radeon-git,libdrm}
linux 5.3.arch1-1
linux-firmware-agd5f-radeon-navi10 2019.09.13.18.36-1
mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
libdrm 2.4.99-1
lib32-mesa-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115574.40087ffc5b9-1
lib32-libdrm 2.4.99-1

I have an apitrace file but it's 3GB, not sure where I can upload it for someone to investigate (if that can help let me know I'll figure something).

Error log is the same as Marko Popovic from the first comment:
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=48914, emitted seq=48916
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DOOMx64.exe pid 3123 thread DOOMx64.ex:cs0 pid 3172
Comment 61 Matthias Müller 2019-09-22 09:50:15 UTC
(In reply to Jeremy Attali from comment #60)
> Setting AMD_DEBUG=nodma did not work for me. I have an AMD Radeon RX 5700
> XT. Experiencing many crashes after a few minutes playing DOOM (OpenGL).

i've noticed crashes with DOOM on Navi when looking at medpacks: https://github.com/ValveSoftware/Proton/issues/3029
Comment 62 Marko Popovic 2019-09-22 12:03:58 UTC
I have created a new bug report for ring_gfx created hangs since they don't seem to be related to ngg or dma, therefore keep those logs posted there and further trace files etc. https://bugs.freedesktop.org/show_bug.cgi?id=111763

Let's keep this thread limited to sdma0/sdma1 type bugs that are causing random freezes on the desktop, since others seem to be more game-related.
Comment 63 Doug Ty 2019-09-30 12:32:04 UTC
(In reply to Matthias Müller from comment #58)
> Don't know where to report as the navi-firmware is not in the
> kernel-firmware, yet?

Not sure if this is correct, but I've created a separate issue for these "divide error" hangs over here:  
https://bugs.freedesktop.org/show_bug.cgi?id=111869
Comment 64 Marko Popovic 2019-10-02 16:51:58 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #33)
> Created attachment 145323 [details] [review] [review]
> wip patch
> 
> You can give a try to the attached kernel patch which hopefully could
> prevent some sdma timeouts.
> 
> I'm still testing it but the more testers the better :)

Since it's been quite a while when the WIP patch was published I'd like to inquire about the state of this bug, I guess it's not unreasonable since the GPU has now been out for 3 months now? Thanks for the info in advance :)
Comment 65 Shmerl 2019-10-03 06:46:02 UTC
I also don't see this patch landing in 5.4 (rc1 doesn't have it). Should we keep applying it manually for now?
Comment 66 Shmerl 2019-10-03 09:03:00 UTC
I wonder if fixes for powerplay related issues are part of this patchset: https://lists.freedesktop.org/archives/dri-devel/2019-October/238442.html
Comment 67 Marko Popovic 2019-10-03 11:17:24 UTC
(In reply to Shmerl from comment #65)
> I also don't see this patch landing in 5.4 (rc1 doesn't have it). Should we
> keep applying it manually for now?

We probably don't need the WIP version since it didn't work, I was just wondering if the team has any official news on the issue.

I will be trying 5.4 RC series in the following days and see if anything changes regarding the sdma or any other types of hangs due to those Navi related fixes.
Comment 68 Marko Popovic 2019-10-03 12:26:21 UTC
https://cgit.freedesktop.org/mesa/mesa/commit/?id=a2a68d551c1c2a4f13761ffa8f3f6f13fee7a384

This might actually fix the ring_gfx type hangs or even sdma ones at least for Vulkan API? Not exactly sure but will also be testing the latest MESA builds from Oibaf's PPA in following days and report back on the issue :)
Comment 69 Shmerl 2019-10-04 21:14:51 UTC
I just tried recent kernel 5.4-rc1+ from here: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git

It supposedly already has fixes for amdgpu metrics, in this commit:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0f83eb8888690d77f76d27d803488c1047fc9a10

However I just experienced another hang when adding amdgpu fans sensors to ksysguard, so apparently it didn't fix the problem.

And there was also a hang with Firefox, when it was started without AMD_DEBUG=nodma, so that issue is not fixed either yet.

cc Alex Deucher.
Comment 70 Marko Popovic 2019-10-04 21:28:39 UTC
(In reply to Shmerl from comment #69)
> I just tried recent kernel 5.4-rc1+ from here:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> 
> It supposedly already has fixes for amdgpu metrics, in this commit:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=0f83eb8888690d77f76d27d803488c1047fc9a10
> 
> However I just experienced another hang when adding amdgpu fans sensors to
> ksysguard, so apparently it didn't fix the problem.
> 
> And there was also a hang with Firefox, when it was started without
> AMD_DEBUG=nodma, so that issue is not fixed either yet.
> 
> cc Alex Deucher.

Yes, I can confirm that with 5.4 RC1 and MESA-git from 04.10. (with radv patches included) I can reproduce all 4 types of hangs, random desktop hang, Rise of the Tomb Raider Hang, Starcraft II hang and even Citra hang (eventhough those patches supposedly fix the ngg) so that's a huge bummer.
Comment 71 Shmerl 2019-10-04 21:35:20 UTC
(In reply to Marko Popovic from comment #70)
> 
> Yes, I can confirm that with 5.4 RC1 and MESA-git from 04.10. (with radv
> patches included) I can reproduce all 4 types of hangs, random desktop hang,
> Rise of the Tomb Raider Hang, Starcraft II hang and even Citra hang
> (eventhough those patches supposedly fix the ngg) so that's a huge bummer.

Just to clarify, those fixes were added post rc1 tag, so you'd need to build the master branch of Linus's repo (it would produce 5.4-rc1+).
Comment 72 Marko Popovic 2019-10-04 21:40:02 UTC
(In reply to Shmerl from comment #71)
> (In reply to Marko Popovic from comment #70)
> > 
> > Yes, I can confirm that with 5.4 RC1 and MESA-git from 04.10. (with radv
> > patches included) I can reproduce all 4 types of hangs, random desktop hang,
> > Rise of the Tomb Raider Hang, Starcraft II hang and even Citra hang
> > (eventhough those patches supposedly fix the ngg) so that's a huge bummer.
> 
> Just to clarify, those fixes were added post rc1 tag, so you'd need to build
> the master branch of Linus's repo (it would produce 5.4-rc1+).

Sorry I wrote poorly, I'm using 5.4 daily build.

These hangs on Navi seem to be quite a hard nut to crack for AMD it seems, they are trying with different types of patches from amdgpu, firmware, kernel and even mesa, and yet nothing ever changes :(

Maybe this issue should get a high priority at least considering that hangs basically render desktop unusable for many things, quite a few dxvk games produce hangs even with nodma and nongg applied, so no idea what could be going on there. Why do those flags work for some things and not for the others...
Comment 73 ans.belfodil 2019-10-05 17:01:13 UTC
According to this https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-Bulk-Moves-Lands and my tests (Linux 5.3.1 and packages from https://pkgbuild.com/~lcarlier/mesa-git/x86_64/), the hangs are gone on Rocket League.
Comment 74 ans.belfodil 2019-10-05 17:02:39 UTC
Woops copied the wrong link from Phoronix: https://www.phoronix.com/scan.php?page=news_item&px=RADV-Navi-Random-Hangs-19.3
Comment 75 Marko Popovic 2019-10-05 22:17:38 UTC
(In reply to ans.belfodil from comment #73)
> According to this
> https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-Bulk-Moves-Lands
> and my tests (Linux 5.3.1 and packages from
> https://pkgbuild.com/~lcarlier/mesa-git/x86_64/), the hangs are gone on
> Rocket League.

I was able to reproduce the RL hang by running Rocket League 2 times, so it's definitely not gone, also I don't see how those patches would affect the launch of Rocket League anyways, it uses OpenGL and induces SDMA type hang, and those patches are for RADV vulkan driver and ngg (which are different type of hangs that show themselves as ring_gfx hangs)
Comment 76 Sebastian Meyer 2019-10-06 19:20:19 UTC
Created attachment 145668 [details]
dump of the sdma0 ring after a timeout error

Just had another ring sdma0 timeout while being on the desktop and working. Quite infuriating.

[14191.862674] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[14191.862745] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[14196.982476] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=5420633, emitted seq=5420635
[14196.982590] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 716 thread Xorg:cs0 pid 717
[14196.982592] [drm] GPU recovery disabled.

Kernel is built from the `drm-fixes-5.4-2019-10-02` tag (basically 5.4-rc1 + amdgpu commits which will be included in rc2) with the WIP patch of this thread (drm/amdgpu: do not execute 0-sized IBs). All other libs on my system are up2date (Arch, using the mesa-git repo).
Comment 77 Shmerl 2019-10-06 19:37:13 UTC
I suspect the above is OpenGL related bug in radeonsi.
Comment 78 Shmerl 2019-10-06 20:10:01 UTC
Trying now running Firefox with most recent Mesa master, to check if any fixes in radeonsi prevent these hangs now or not.
Comment 79 Shmerl 2019-10-07 00:57:19 UTC
Just got a freeze using Firefox even with Mesa master. So it's not fixed.
Comment 80 Shmerl 2019-10-08 16:57:35 UTC
Looks like there are a bunch of fixes related to powerplay here: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next

So if anyone has time to test, may be some of them fix that concurrent sensor access bug that still happens in 5.4-rc2 despite previous added mutex fix already being there.
Comment 81 Shmerl 2019-10-08 17:44:03 UTC
Also opened Firefox specific bug here in case it's radeonsi issue: https://gitlab.freedesktop.org/mesa/mesa/issues/1910
Comment 82 Jaap Buurman 2019-10-10 07:57:10 UTC
Running the 5.3.5 Kernel, 19.2.0 Mesa and latest firmware in Arch's repository I am also running into the same issue:

[46623.025576] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=75019, emitted seq=75022
[46623.025668] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium pid 24183 thread chromium:cs0 pid 24244


This happened while switching tabs in Chromium. The umr command is not available for me, and AFAIK the official arch repository doesn't have any packages that provide it.
Comment 83 Pierre-Eric Pelloux-Prayer 2019-10-11 10:24:11 UTC
Another kernel patch worth trying: https://patchwork.freedesktop.org/patch/335077/
Comment 84 Shmerl 2019-10-11 20:14:27 UTC
Testing this patch now, using Firefox with nodma.
Comment 85 Shmerl 2019-10-11 20:14:43 UTC
(In reply to Shmerl from comment #84)
> Testing this patch now, using Firefox with nodma.

without* nodma.
Comment 86 Shmerl 2019-10-11 21:03:26 UTC
Looks stable so far, no hangs. I'll continue using it, and will post if it occurs again.
Comment 87 Marko Popovic 2019-10-11 23:06:51 UTC
All of the hangs are still present for me, so this patch changed nothing.
Comment 88 Shmerl 2019-10-13 01:30:09 UTC
(In reply to Marko Popovic from comment #87)
> All of the hangs are still present for me, so this patch changed nothing.

Does Firefox hang for you still?
Comment 89 Marko Popovic 2019-10-13 15:13:07 UTC
(In reply to Shmerl from comment #88)
> (In reply to Marko Popovic from comment #87)
> > All of the hangs are still present for me, so this patch changed nothing.
> 
> Does Firefox hang for you still?

Actually I was too fast to judge!

The desktop itself did never hang for me out of the blue like before since using this patch.

Rise of the Tomb Raider, Starcraft II and Rocket League hangs still happen though.

I will keep using this without nodma and let you guys know if random hangs come back.
Comment 90 Shmerl 2019-10-13 15:51:54 UTC
Those hangs are likely shader related, so not the same thing. But desktop (Firefox specifically for me) hangs look indeed fixed to me so far.
Comment 91 Marko Popovic 2019-10-13 15:54:47 UTC
(In reply to Shmerl from comment #90)
> Those hangs are likely shader related, so not the same thing. But desktop
> (Firefox specifically for me) hangs look indeed fixed to me so far.

This might actually allow us to have nodma disabled OS-wise and just keep it on for the games that hang the SDMA ring specifically :) I hope that I will have the same experience as you and that "random" hangs on the desktop are indeed fixed.
Comment 92 Marko Popovic 2019-10-13 21:44:48 UTC
Ok day recap: been using my PC for like 12 hours today, without using the nodma variable, no "random" hangs appeared on the desktop, I will make another reply in a few days if things keep like this, hopefully they do.

Considering that Rocket League launch still provokes the SDMA type hang I'd suggest to leave this bug report open for further tracking and tracing the remaining "non-random" SDMA bugs.
Comment 93 Pierre-Eric Pelloux-Prayer 2019-10-14 10:09:15 UTC
Created attachment 145734 [details] [review]
sdma read delay

Hi all,

Here's a new patch that should help with sdma issues.

This is not a replacement for https://patchwork.freedesktop.org/patch/335077/ nor https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 so ideally you should have the 3 patches applied and the "AMD_DEBUG=nodma" workaround disabled.

Let me know if it helps getting rid of the sdma timeout errors. Thanks!
Comment 94 Marko Popovic 2019-10-14 10:20:33 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #93)
> Created attachment 145734 [details] [review] [review]
> sdma read delay
> 
> Hi all,
> 
> Here's a new patch that should help with sdma issues.
> 
> This is not a replacement for
> https://patchwork.freedesktop.org/patch/335077/ nor
> https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 so ideally you
> should have the 3 patches applied and the "AMD_DEBUG=nodma" workaround
> disabled.
> 
> Let me know if it helps getting rid of the sdma timeout errors. Thanks!

Excellent! I will test it further when I come home, I only have the recent SDMA patch applied in the kernel and "random" hangs were already gone, hopefully this will help fix the game-specific provoked SDMA type hangs! Will report back in the following days.
Comment 95 Marko Popovic 2019-10-14 16:48:51 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #93)
> Created attachment 145734 [details] [review] [review]
> sdma read delay
> 
> Hi all,
> 
> Here's a new patch that should help with sdma issues.
> 
> This is not a replacement for
> https://patchwork.freedesktop.org/patch/335077/ nor
> https://bugs.freedesktop.org/show_bug.cgi?id=111481#c33 so ideally you
> should have the 3 patches applied and the "AMD_DEBUG=nodma" workaround
> disabled.
> 
> Let me know if it helps getting rid of the sdma timeout errors. Thanks!

Ok here's feedback already, I haven't had any issues with "random" desktop freezes since yesterday since I applied the patch https://patchwork.freedesktop.org/patch/335077/

So basically I can now use desktop for the most part without nodma globally enabled, I will report back if it turns out not to be the case.

With all those 3 patches applied, Rocket League successfully launched 2 times out of 10, so I could say that 80% of time it will still provoke the SDMA type hang. Unfortunately this is the only game that I know that provokes the SDMA hang compared to ring_gfx hangs on Starcraft II and Rise of the Tomb Raider.
Comment 96 Sebastian Meyer 2019-10-15 19:48:35 UTC
The desktop freezes related to sdma0 timeout errors are definitely not fixed with the addition of this third patch. I just had another one while working in WebStorm and PyCharm for a couple of hours.

Kernel in use was/is `drm.fixes.5.4.2019.10.09.r0.g083164dbdb17-3` with
- drm-amdgpu-do-not-execute-0-sized-IBs.patch
- drm-amdgpu-sdma5-fix-mask-value-of-POLL_REGMEM-packet-for-pipe-sync.patch
- sdma-read-delay.patch

[103602.655947] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=11405753, emitted seq=11405755
[103602.656061] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 710 thread Xorg:cs0 pid 711
[103602.656062] [drm] GPU recovery disabled.

Unfortunately, I forgot to run `sudo umr -R sdma0` after ssh-ing into my system and reading the dmesg output, so no further debug output from me. :(
Comment 97 Shmerl 2019-10-16 17:50:29 UTC
Unfortunately I just got a random sdma Firefox hang, with that bitmask fix enabled. I didn't enable other two patches above though. Building the 5.4-rc3 with all three applied now.
Comment 98 David Biró 2019-10-16 22:41:48 UTC
Recently, I've just become an RX 5700 xt user, with the same issues, but for me the keyboard and the mouse also dies after those errors. ( There is a Caps Lock indicator led, but after the error I can't toggle it, (but it gets power) ) 

Can I help you somehow? (Unfortunately, I've got no idea how can I patch a kernel module on Arch.)
Comment 99 Shmerl 2019-10-16 23:01:15 UTC
(In reply to David Biró from comment #98)
> 
> Can I help you somehow? (Unfortunately, I've got no idea how can I patch a
> kernel module on Arch.)

On Debian testing, I simply build a whole kernel, applying needed patches first.
Comment 100 Andrew Sheldon 2019-10-17 06:59:55 UTC
I'll add that the SDMA fixes don't help for me either. mpv + gpu-hq profile (OGL) reproduces the issue the most reliably, albeit still randomly.
Comment 101 Marko Popovic 2019-10-17 19:31:26 UTC
Created attachment 145766 [details]
APITrace from Rocket League successful launch

Ok so since it was unable to be reproduced with that 1 frame long trace, here I'm attaching a trace file that happens when Rocket League launches successfully... Maybe you guys can get some information out of the trace on what might cause the SDMA hangs 80% of the time when launching the game.
Comment 102 Marko Popovic 2019-10-17 19:33:02 UTC
(In reply to Marko Popovic from comment #101)
> Created attachment 145766 [details]
> APITrace from Rocket League successful launch
> 
> Ok so since it was unable to be reproduced with that 1 frame long trace,
> here I'm attaching a trace file that happens when Rocket League launches
> successfully... Maybe you guys can get some information out of the trace on
> what might cause the SDMA hangs 80% of the time when launching the game.

PS: Hang usually occurs immidiately when games try to launch itself.
Comment 103 Shmerl 2019-10-17 19:38:10 UTC
I just got a random Firefox freeze with all three above patches applied. So it's clearly not fixed yet (though such hangs are a lot less common than before now):

[78836.138723] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[78841.770422] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=133096, emitted seq=133098
[78841.770490] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GPU Process pid 1882 thread firefox-bi:cs0 pid 2034
[78841.770493] [drm] GPU recovery disabled.
Comment 104 kinovavi 2019-10-18 11:05:49 UTC
Are these bugs going to be fixed anytime soon? With much respect, this is insane. The card has been out for some 4 months now and it still isn’t usable for most on Linux. Yes, for some the patches at least let them browse firefox effectively, but for some it doesn’t. We’re Amd customers like anyone else, support was supposed to be introduced on Mesa 19.2 and improved on Mesa 19.3, so far none of these two versions work properly. Please get your shit together Amd, this is ridiculous.
Comment 105 Marko Popovic 2019-10-18 11:13:40 UTC
(In reply to kinovavi from comment #104)
> Are these bugs going to be fixed anytime soon? With much respect, this is
> insane. The card has been out for some 4 months now and it still isn’t
> usable for most on Linux. Yes, for some the patches at least let them browse
> firefox effectively, but for some it doesn’t. We’re Amd customers like
> anyone else, support was supposed to be introduced on Mesa 19.2 and improved
> on Mesa 19.3, so far none of these two versions work properly. Please get
> your shit together Amd, this is ridiculous.

With all respect to you sir, calling names and aggressiveness isn't going to fix any issues.
AMD team is aware of the issues, but debugging hangs that are hardly reproducable or even random in some cases is very hard. Now I'm not entirely defending AMD here, it is manufacturers' job to support the hardware if they claim on the box that they do, however calling names isn't going to get anything fixed.
Comment 106 Daniel Suarez 2019-10-18 12:01:38 UTC
I do agree we should remain civilized in this, but user is correct, this is ridiculous. 

This would be somewhat understandable if this was a GPU from the business line and it was a GPU barely anyone owns, but this is a popular GPU and it is constantly recommended to others.
Comment 107 Marko Popovic 2019-10-18 12:05:18 UTC
(In reply to Daniel Suarez from comment #106)
> I do agree we should remain civilized in this, but user is correct, this is
> ridiculous. 
> 
> This would be somewhat understandable if this was a GPU from the business
> line and it was a GPU barely anyone owns, but this is a popular GPU and it
> is constantly recommended to others.

Yes, Imagine if a bussiness with 100 linux workstations went for Navi cards and faced these issues... it would be a tremendous loss for them, I really don't think that this was fair to AMD Linux customers at all, but calling them names won't get us anywhere.

and just as we speak about that I had a random SDMA hang in Firefox, they are way less frequent in FF now, but very much still present.
Comment 108 Daniel Suarez 2019-10-18 13:21:51 UTC
(In reply to Marko Popovic from comment #107)
> (In reply to Daniel Suarez from comment #106)
> > I do agree we should remain civilized in this, but user is correct, this is
> > ridiculous. 
> > 
> > This would be somewhat understandable if this was a GPU from the business
> > line and it was a GPU barely anyone owns, but this is a popular GPU and it
> > is constantly recommended to others.
> 
> Yes, Imagine if a bussiness with 100 linux workstations went for Navi cards
> and faced these issues... it would be a tremendous loss for them, I really
> don't think that this was fair to AMD Linux customers at all, but calling
> them names won't get us anywhere.
> 
> and just as we speak about that I had a random SDMA hang in Firefox, they
> are way less frequent in FF now, but very much still present.


I don't think anyone here is calling anyone anything, just that amd needs to get their shit together, which is completely true.
Comment 109 Shmerl 2019-10-18 20:48:36 UTC
I'm not sure if it really makes any difference, but I think Firefox hang happens more commonly after I resume computer from suspend, than after a fresh boot. I'll pay attention now, testing with fresh boot, to see it happens or not.
Comment 110 Sebastian Meyer 2019-10-19 05:49:50 UTC
Created attachment 145773 [details]
umr output after sdma0 timeout

Another random sdma0 timeout while using kernel drm.fixes.5.4.2019.10.16.r0.gd12c50857c6e-1 with all mentioned patches applied (one of them already included on the drm-fixes branch). This time I didn't forget about the umr debug output, but I'm not sure if it's even relevant anymore considering the number of already submitted reports.

The system freeze happened after working with WebStorm and SmartGit for roughly 10 minutes on KDE Plasma while scrolling in one of the application's windows.

[39816.999159] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[39816.999298] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
[39821.905604] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3360854, emitted seq=3360856
[39821.905718] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 717 thread Xorg:cs0 pid 718
[39821.905720] [drm] GPU recovery disabled.

I would really appreciate it if AMD and the AMDGPU devs could focus on fixing these major stability issues of their now almost 4 months old mainstream consumer GPUs. I'm sorry if this sounds harsh, but the hardware has been advertised with Linux support and it's clearly unusable. This needs to be fixed as soon as possible. Thank you!
Comment 111 Jeremy Attali 2019-10-19 20:57:44 UTC
I confirm I'm also still getting some hangs from time to time. Mostly I think after a resume from Suspend.


pacman -Q linux linux-firmware {,lib32-}{mesa-git,vulkan-radeon-git,libdrm}

linux 5.3.6.arch1-1
linux-firmware 20190923.417a9c6-1
mesa-git 1:19.3.0_devel.116317.268e0e01f37-1
vulkan-radeon-git 1:19.3.0_devel.116317.268e0e01f37-1
libdrm 2.4.99-1
lib32-mesa-git 1:19.3.0_devel.116317.268e0e01f37-1
lib32-vulkan-radeon-git 1:19.3.0_devel.116317.268e0e01f37-1
lib32-libdrm 2.4.99-1
Comment 112 Shmerl 2019-10-20 01:38:16 UTC
(In reply to Jeremy Attali from comment #111)
> I confirm I'm also still getting some hangs from time to time. Mostly I
> think after a resume from Suspend.

I wonder if on resume something is getting messed up, and it's a motherboard firmware dependent issue?
Comment 113 Daniel Suarez 2019-10-20 13:59:57 UTC
(In reply to Shmerl from comment #112)
> (In reply to Jeremy Attali from comment #111)
> > I confirm I'm also still getting some hangs from time to time. Mostly I
> > think after a resume from Suspend.
> 
> I wonder if on resume something is getting messed up, and it's a motherboard
> firmware dependent issue?

I don't ever suspend so I can't comment, but the issue for me happens constantly regardless. Completely unusable honestly, had to put back in my GTX1060 in the meantime, hopefully AMD fixes this issue soon
Comment 114 Shmerl 2019-10-20 20:13:11 UTC
Just to clarify. Is this affecting only OpenGL code paths? Firefox with WebRender is for example using OpenGL. I.e. can those hangs be a problem with radeonsi doing something incorrectly, or it's for sure a bug in amdgpu kernel driver?
Comment 115 Mark Dietzer 2019-10-20 21:07:26 UTC
For me it seems to happen commonly when I watch 60fps video (YouTube) using Firefox on my RX 5700 XT (currently on Fedora 31 with latest distro packages).
Even 4K video at 30fps does not seem to cause any issues.

I have not yet managed to reproduce the hang in gaming or benchmark use (no matter if OpenGL or Vulkan)

The first time this happened today it was accompanied by the following kernel messages and led to a full lockup of graphics until reboot:
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
amdgpu: [powerplay] Failed to export SMU metrics table!
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
amdgpu: [powerplay] Failed to export SMU metrics table!
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
amdgpu: [powerplay] Failed to export SMU metrics table!
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
amdgpu: [powerplay] Failed to export SMU metrics table!
amdgpu: [powerplay] Failed to send message 0x36, response 0xfffffffb, param 0x0
amdgpu: [powerplay] Failed to send message 0x36, response 0xfffffffb param 0x0
amdgpu: [powerplay] [smu_v11_0_get_power_limit] get PPT limit failed!
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param 0x80
amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param 0x80
amdgpu: [powerplay] Failed to export SMU metrics table!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=202333, emitted seq=202336
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0

The second time, it was only a short (few seconds) hang and yielded the following kernel output, currently still up and running after that message:
[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Comment 116 Shmerl 2019-10-20 21:17:02 UTC
Metrics bug is something with powerplay, that's a different one from sdma timeouts. It happens when you query amdgpu sensors concurrently. To avoid it, simply don't query them (like using ksysguard or any other GPU temperature, fans and all that kind of monitoring).

Though it's for sure annoying that that sensors bug is still not fixed yet.
Comment 117 Daniel Suarez 2019-10-21 01:24:42 UTC
(In reply to Mark Dietzer from comment #115)
> For me it seems to happen commonly when I watch 60fps video (YouTube) using
> Firefox on my RX 5700 XT (currently on Fedora 31 with latest distro
> packages).
> Even 4K video at 30fps does not seem to cause any issues.
> 
> I have not yet managed to reproduce the hang in gaming or benchmark use (no
> matter if OpenGL or Vulkan)
> 
> The first time this happened today it was accompanied by the following
> kernel messages and led to a full lockup of graphics until reboot:
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param
> 0x80
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param
> 0x80
> amdgpu: [powerplay] Failed to export SMU metrics table!
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param
> 0x80
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param
> 0x80
> amdgpu: [powerplay] Failed to export SMU metrics table!
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param
> 0x80
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param
> 0x80
> amdgpu: [powerplay] Failed to export SMU metrics table!
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param
> 0x80
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param
> 0x80
> amdgpu: [powerplay] Failed to export SMU metrics table!
> amdgpu: [powerplay] Failed to send message 0x36, response 0xfffffffb, param
> 0x0
> amdgpu: [powerplay] Failed to send message 0x36, response 0xfffffffb param
> 0x0
> amdgpu: [powerplay] [smu_v11_0_get_power_limit] get PPT limit failed!
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb, param
> 0x80
> amdgpu: [powerplay] Failed to send message 0xe, response 0xfffffffb param
> 0x80
> amdgpu: [powerplay] Failed to export SMU metrics table!
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled
> seq=202333, emitted seq=202336
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid
> 0 thread  pid 0
> 
> The second time, it was only a short (few seconds) hang and yielded the
> following kernel output, currently still up and running after that message:
> [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed
> out or interrupted!

Test out kernel 5.4rc4, it should have addressed this I believe.
Comment 118 Andrew Sheldon 2019-10-21 02:21:44 UTC
(In reply to Daniel Suarez from comment #117)
> Test out kernel 5.4rc4, it should have addressed this I believe.
If you're referring to: drm/amdgpu/sdma5: fix mask value of POLL_REGMEM packet for pipe sync

it doesn't actually fix the problem (and the majority in this thread have already applied this patch, along with the other two workarounds). It might fix some of the sdma hangs, but not all of them.
Comment 119 Daniel Suarez 2019-10-21 10:36:49 UTC
(In reply to Andrew Sheldon from comment #118)
> (In reply to Daniel Suarez from comment #117)
> > Test out kernel 5.4rc4, it should have addressed this I believe.
> If you're referring to: drm/amdgpu/sdma5: fix mask value of POLL_REGMEM
> packet for pipe sync
> 
> it doesn't actually fix the problem (and the majority in this thread have
> already applied this patch, along with the other two workarounds). It might
> fix some of the sdma hangs, but not all of them.

You're right those patches don't really help at all, I was referring to the powerplay one
Comment 120 Daniel Suarez 2019-10-21 12:02:53 UTC
Am I correct in assuming that there's no other patches or commits waiting to be upstreamed? Great, Mesa 19.2.2 will release this Wednesday and again be a other release that's unusable. Same goes for Mesa-git 19.3 I suppose, unacceptable from Amd.
Comment 121 bugs 2019-10-22 15:50:23 UTC
I have the same problem using archlinux. I tried mesa+llvm stable (19.2/9.0), the git-versions with amdgpu and even with plain modesetting. I have random freezes with xfce (with and without compositor) and nearly immediatly freezes with Rise of the Tomb Raider. "Freezing" means X11, Magic SysRQ and SSH still works.
I had to remove the card because the computer was competely unusable with 4 freezes in 15 minutes. So I can't provide you with more information, sorry.
But if I can give you any information without putting the card back into the computer (the slot has suffered a bit...) I am here.

Now I found this bug report and wonder, why it is 8 weeks old, still "new" and unassigned and severity is not set. In my opinion a freezing computer is really critical! 

And I wonder why the bug is only at Arch/Manjaro and Ubuntu. Are all other distris too old to work with Navi completely? I didn't even found a report from Gentoo.
Comment 122 Marko Popovic 2019-10-22 15:57:04 UTC
(In reply to bugs from comment #121)
> I have the same problem using archlinux. I tried mesa+llvm stable
> (19.2/9.0), the git-versions with amdgpu and even with plain modesetting. I
> have random freezes with xfce (with and without compositor) and nearly
> immediatly freezes with Rise of the Tomb Raider. "Freezing" means X11, Magic
> SysRQ and SSH still works.
> I had to remove the card because the computer was competely unusable with 4
> freezes in 15 minutes. So I can't provide you with more information, sorry.
> But if I can give you any information without putting the card back into the
> computer (the slot has suffered a bit...) I am here.
> 
> Now I found this bug report and wonder, why it is 8 weeks old, still "new"
> and unassigned and severity is not set. In my opinion a freezing computer is
> really critical! 


I kinda wonder that myself. I set it to critical and AMD dev removed the tag critical so they apparently disagree that not being able to use your hardware is a critical bug (thinking).

+ Bug is present on all systems running LVVM9 and MESA 19.2+... Ubuntu too.
Comment 123 Sabbie 2019-10-22 16:19:34 UTC
I'm having the same problem on an RX 5700, running Arch.

- 3.5.7 Kernel
- mesa-git 1:19.3.0_devel.116477.3ad6154f4eb-1 
- llvm-git 10.0.0_r329841.1c982af0599-1

GPU crashes on various activities and seemingly at random. Happened both while browsing and playing games. Usually it crashes with `ring gfx_0.0.0 timeout`. Sometimes it works for hours, sometimes it crashes every 5 minutes.

I can provide logs if needed.
Comment 124 yamagi 2019-10-22 18:00:06 UTC
Interestingly I've got the problem the other way round. My 5700XT was running fine since I got it about two weeks ago. This is Arch Linux, I've run Mesa 19.2.1 and llvm-libs 9.0.0 since day one. The card was stable with 5.4-RC2 and 5.4-RC3, not a single hang in about 10 hours The Witcher 3 under wine + dxvk and Yamagi Quake II with OpenGL 3.2 renderer. After I upgraded to 5.4-RC4 I've seen several GPU hangs. The last one, and the only one that's still in the logs was:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=85270, emitted seq=85272

That one was in Yamagi Quake II, but I had hangs on the desktop and in The Witcher 3. I have no umr reports so far. I've just compiled the tool and will see if I can get some.
Comment 125 Shmerl 2019-10-23 02:32:48 UTC
Just built 5.4-rc4.

I still get these in dmesg when using ksysguard with amdgpu sensors:

[  323.750015] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18)      param: 0x00000006 response 0xfffffffb
[  323.750018] amdgpu: [powerplay] Failed to export SMU metrics table!

However so far it didn't cause a hang like it used to do before, which is an improvement for the powerplay bug. But the message shows that something wrong is still going on.
Comment 126 Benjamin Neff 2019-10-23 12:54:22 UTC
I have the same problem on gentoo with 5.4 kernel and mesa-git, so it's not only arch and ubuntu, it's all distros. But I didn't create an account before, because I thought there were already enough comments which confirm the bug. I hope this can be fixed soon because it's really annoying, tell me if I can help with anything.

I think the powerplay/metrics error is a different bug, I saw that too on my system, but not that often and unrelated to the freezes (so way less annoying).
Comment 127 Shmerl 2019-10-23 13:50:17 UTC
powerplay is a different bug from the sdma one, but it was listed as part of this report before, that's why I mentioned it above.
Comment 128 yamagi 2019-10-23 17:18:44 UTC
(In reply to yamagi from comment #124)
> Interestingly I've got the problem the other way round. My 5700XT was
> running fine since I got it about two weeks ago. This is Arch Linux, I've
> run Mesa 19.2.1 and llvm-libs 9.0.0 since day one. The card was stable with
> 5.4-RC2 and 5.4-RC3, not a single hang in about 10 hours The Witcher 3 under
> wine + dxvk and Yamagi Quake II with OpenGL 3.2 renderer. After I upgraded
> to 5.4-RC4 I've seen several GPU hangs. The last one, and the only one
> that's still in the logs was:
> 
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled
> seq=85270, emitted seq=85272
> 
> That one was in Yamagi Quake II, but I had hangs on the desktop and in The
> Witcher 3. I have no umr reports so far. I've just compiled the tool and
> will see if I can get some.


As promised, some more informations:

For me the crash is fairly easy to reproduce with Linux 5.4-RC4. All it takes is Yamagi Quake II (Revision 1232289, can be found at https://github.com/yquake2/yquake2) with OpenGL 3.2 renderer. The old OpenGL 1.4 doesn't trigger it. Start the game, it's a good idea to set set timedemo mode to 1, and just let it cycle through the demo loop until it crashes. I used './quake +set timedemo 1 +set vid_renderer gl3'. I've never experienced this crash in the wild with Linux 5.4-RC3 until I learned that I can trigger with the Quake II demo loop. In Linux 5.4-RC3 it usually takes somewhere between 20 to 30 cycles through loop to trigger, with 5.4-RC4 only 5 to 10 cycles. So something changed between RC3 and RC4 that made it more likely.

I suspect some kind of timing issue. The demo loop is deterministic, it generates exactly the same API calls each time it's run. While the crash always happens while the loading screen is up, it never occures at the same one. Sometimes it's in the fifth iteration, the next time at the 12th and so on. Putting apitrace (adds some latency!) onto it, makes it much less likely to occure. To the point I thought that it's a heisenbug. The same goes for cycling through the loop without timedemo mode enabled (~20 FPS in normal mode, ~1000 FPS in timedemo mode).

I made an apitrace for easier reproduction. It's a little bit big for bugzilla, so I've uploaded it here: https://deponie.yamagi.org/temp/quake2.trace.xz Replaying it usually triggers the crash during the first or second run.

The exact software versions were:
* Linux 5.4-RC4 with https://bugzilla.freedesktop.org/attachment.cgi?id=145323 and https://bugzilla.freedesktop.org/attachment.cgi?id=145734 applied.
* Mesa 19.2.1-2
* LLVM 9.0.0

dmesg output after a crash in Quake IIs demo loop is:
[  122.294181] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=177737, emitted seq=177739
[  122.294256] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process glretrace pid 1302 thread glretrace:cs0 pid 1303
[  122.294257] [drm] GPU recovery disabled.

dmesg output after a crash by replaying the apitrace is:
[  266.695388] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=27598, emitted seq=27600
[  266.695463] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process glretrace pid 1372 thread glretrace:cs0 pid 1373
[  266.695465] [drm] GPU recovery disabled.

I'm attaching the state of sdma0 is both cases.

I hope this helps to find the root cause of this. If can provide more informations don't hesitate to ask.
Comment 129 yamagi 2019-10-23 17:19:34 UTC
Created attachment 145799 [details]
sdma0 after apitrace crash
Comment 130 yamagi 2019-10-23 17:20:09 UTC
Created attachment 145800 [details]
sdma0 after q2 crash
Comment 131 Shmerl 2019-10-23 18:18:11 UTC
Does it also hang with Mesa master?
Comment 132 Jaap Buurman 2019-10-23 18:21:00 UTC
Many people are experiencing the hangs with OpenGL: Quake with the OpenGL renderer, Chrome/Chromium, Firefox, etc.

I am beginning to suspect that RadeonSI might be to blame instead of the AMDGPU kernel driver. Does anyone agree with that notion?
Comment 133 Shmerl 2019-10-23 18:31:06 UTC
(In reply to Jaap Buurman from comment #132)
>
> I am beginning to suspect that RadeonSI might be to blame instead of the
> AMDGPU kernel driver. Does anyone agree with that notion?

That's my suspicion as well, since I haven't gotten any hangs so far with radv/llvm or radv/aco when playing games, especially after aco added several GPU hazard mitigations for Navi:

https://gitlab.freedesktop.org/mesa/mesa/blob/master/src/amd/compiler/README

I wonder if the above hangs are related to similar GPU hardware bugs in Navi, that weren't yet worked around in radeonsi.

I even opened one bug like that, but closed it due to assuming it's really amdgpu problem. Not sure if I should re-open it:

https://gitlab.freedesktop.org/mesa/mesa/issues/1910
Comment 134 Benjamin Neff 2019-10-23 18:35:07 UTC
I had the freeze with different versions of mesa-git, but I didn't update that since two days, I can try with the current master and see if it still freezes. I don't have a way to reproduce it, it is usually random and I never had two 
similar freezes.

I don't know if OpenGL is involved in all of my freezes, I had freezes while watching a video on youtube in chromium, also while just browsing the web in firefox, once it crashed during rendering file preview images of photos in nautilus, and once it crashed while the screen was locked with i3lock. It didn't freeze during games yet.

I have always a browser open, so it could just be a browser doing something in the background. The process mentioned in the error message is usually Xorg itself, but once it was chrome.
Comment 135 Shmerl 2019-10-23 18:36:40 UTC
(In reply to Benjamin Neff from comment #134)
> 
> I don't know if OpenGL is involved in all of my freezes, I had freezes while
> watching a video on youtube in chromium, also while just browsing the web in
> firefox, once it crashed during rendering file preview images of photos in
> nautilus, and once it crashed while the screen was locked with i3lock. It
> didn't freeze during games yet.
> 

Both Firefox and Chromium are only using OpenGL paths, they aren't using Vulkan. So it's safe to assume it's going through radeonsi.
Comment 136 Jaap Buurman 2019-10-23 18:56:44 UTC
Has anyone tried AMD's closed source OpenGL driver to see if that one is stable?
Comment 137 Alexandr Kára 2019-10-23 19:04:34 UTC
It hangs reproducibly with RADV (Vulkan) as well. As an example, many people report crashes with Return of the Tomb Raider.
Comment 138 Shmerl 2019-10-23 19:30:40 UTC
(In reply to Alexandr Kára from comment #137)
> It hangs reproducibly with RADV (Vulkan) as well. As an example, many people
> report crashes with Return of the Tomb Raider.

It could be a bug in radv as well related to Navi hazards.
Comment 139 Daniel Suarez 2019-10-23 20:04:16 UTC
I get instant hangs when playing Space Engineers, the moment I load into a world it completely hands my system, cannot even enter TTY. 

Tested with Manjaro and Mesa-git along with all the other packages recommended in https://wiki.archlinux.org/index.php/Navi_10
Comment 140 Shmerl 2019-10-23 20:12:49 UTC
For individual games, I recommend opening separate bugs for each title.
Comment 141 Pierre-Eric Pelloux-Prayer 2019-10-23 20:16:30 UTC
Thanks for the quake 2 trace, I could reproduce the same hang here.

If anyone has a reliable way to trigger the issue, the most helpful thing to do for now is an apitrace capture.

The umr log were helpful (thanks!) but I don't need more of them at the moment.

I don't think radv uses SDMA at all, so they cannot be affected by this issue. 
For radeonsi the AMD_DEBUG=nodma environment variable is a workaround until we figure out a proper fix.
Comment 142 Jaap Buurman 2019-10-23 20:25:56 UTC
How can I set both AMD_DEBUG=nongg and AMD_DEBUG=nodma in the /etc/environment file? Do they need to be on two separate lines, or will the second line simply overwrite the first one by setting the same environment variable? Do they need to be comma separated maybe?
Comment 143 Shmerl 2019-10-23 20:30:43 UTC
(In reply to Jaap Buurman from comment #142)
> How can I set both AMD_DEBUG=nongg and AMD_DEBUG=nodma in the
> /etc/environment file? Do they need to be on two separate lines, or will the
> second line simply overwrite the first one by setting the same environment
> variable? Do they need to be comma separated maybe?

It's probably better to avoid a wide setting like that. If you know some applications that hangs (like Firefox or specific game), just set that when launching it (you can for example add it to .desktop file or some start script).
Comment 144 Jaap Buurman 2019-10-23 20:36:32 UTC
I need as close as 100% uptime on this machine as possible, so I don't really have the time to add applications over time until the problem is fixed. I need stability now. So a systemwide setting is fine for me, even if it might result in big performance losses. I'll wait until a proper fix is found.

Do you happen to know whether it will require two lines to set both debug options, or does the environment variable expect the values to be comma-separated?
Comment 145 Shmerl 2019-10-23 20:44:12 UTC
According to

man environment

The /etc/environment file specifies the environment variables to be set. The file must consist of simple NAME=VALUE pairs on separate lines.
Comment 146 Shmerl 2019-10-23 20:49:45 UTC
You can also use Your $HOME/.profile for setting session wide variables.
Comment 147 Daniel Suarez 2019-10-23 21:04:17 UTC
(In reply to Jaap Buurman from comment #144)
> I need as close as 100% uptime on this machine as possible, so I don't
> really have the time to add applications over time until the problem is
> fixed. I need stability now. So a systemwide setting is fine for me, even if
> it might result in big performance losses. I'll wait until a proper fix is
> found.
> 
> Do you happen to know whether it will require two lines to set both debug
> options, or does the environment variable expect the values to be
> comma-separated?

You shouldn't be using a 5700 XT in a system that demands 100% uptime, I have had mine randomly hang in the night without Firefox even being open, only qbittorrent and discord
Comment 148 Shmerl 2019-10-23 21:06:02 UTC
(In reply to Daniel Suarez from comment #147)
> 
> You shouldn't be using a 5700 XT in a system that demands 100% uptime, I
> have had mine randomly hang in the night without Firefox even being open,
> only qbittorrent and discord

For the reference, common UI toolkits (GTK and Qt) use OpenGL rendering too.
Comment 149 Seba Pe 2019-10-24 01:05:38 UTC
(In reply to Jaap Buurman from comment #136)
> Has anyone tried AMD's closed source OpenGL driver to see if that one is
> stable?

I've been running AMDGPU-PRO without issues while waiting for a fix for this (5700XT). OpenGL apps appear to work fine. With Vulkan I've had a crash to desktop but I haven't tested it that much. No freezes at least.
Comment 150 Stijn Tintel 2019-10-24 08:14:23 UTC
(In reply to Jaap Buurman from comment #142)
> How can I set both AMD_DEBUG=nongg and AMD_DEBUG=nodma in the
> /etc/environment file? Do they need to be on two separate lines, or will the
> second line simply overwrite the first one by setting the same environment
> variable? Do they need to be comma separated maybe?

AMD_DEBUG="nodma nongg"

I've been running like this since I found this bug report. Current uptime:
11:08:41 up 4 days,  4:12, 11 users,  load average: 8,56, 8,33, 8,15

Haven't experienced a single hang, not even a kernel oops. Before that, the system was frustratingly unstable. If you need stability, put this in /etc/environment (or /etc/env.d/99amdgpu or so if your distro supports /etc/env.d).

Running on Gentoo, kernel 5.3.4, mesa 19.2.1, llvm 9.0.0, libdrm 2.4.99, xf86-video-amdgpu git e6fce59a071220967fcd4e2c9e4a262c72870761.
Comment 151 L.S.S. 2019-10-24 13:25:58 UTC
Created attachment 145807 [details]
captured GCVM_L2_PROTECTION_FAULT errors in the log. This was captured on 5.4(rc) kernel.

I'm having similar issues with Navi on Manjaro (both 5.3 and 5.4 kernels). Both kernels were from official Manjaro repos.

It's almost 100% reproducible using Cinnamon's file manager, Nemo. It can happen right after I start it, or after I click something (such as opening a folder). Interestingly, I haven't gotten a freeze from use web browsers (Firefox, Chromium) just yet.

When the system froze, the rest of the stuffs are still running. The froze happened in the morning and since I was about to leave for work I left the system as is (until I get back home in the evening). The xmrig (CPU) mining session in the background continued to work as normal as observed from the pool's dashboard.

It seems the protection fault errors would appear after the system has frozen long enough (I only saw it appear at the time I left it on frozen for a while, and the rest of the times I reset my system right after it froze). If resetting the system only a short a while after the freeze happened, the log will end only at "ring sdma0 timeout".

It seems the "nodma nongg" trick partially worked on 5.3 (5.3.6 to be precise) as the system hasn't frozen for the time being (even when using Nemo). It however, doesn't work with the 5.4 (rc) kernel as I still got a freeze caused by the same "ring sdma0 timeout" error.

Off-topic: On 5.3 kernel, the mouse cursor feels sluggish as if my monitor is running at 30Hz (while xrandr reports it's indeed 60Hz), while the mouse cursor works fine on 5.4(rc) kernel.
Comment 152 L.S.S. 2019-10-24 14:21:09 UTC
UPDATE: I just got another freeze on 5.3.6 kernel. The same GCVM_L2_PROTECTION_FAULT error followed by a ring sdma0 timeout.

So it seems AMD_DEBUG="nodma nongg" doesn't really work for me.
Comment 153 Marko Popovic 2019-10-24 16:18:15 UTC
(In reply to L.S.S. from comment #152)
> UPDATE: I just got another freeze on 5.3.6 kernel. The same
> GCVM_L2_PROTECTION_FAULT error followed by a ring sdma0 timeout.
> 
> So it seems AMD_DEBUG="nodma nongg" doesn't really work for me.

Can you at least provide the dmesg log so we can determine what type of hang you're having and directing you to the right bugtracker, since there are multiple types. This also varies greatly from one desktop environment to other, wayland or not etc. This topic is mostly concerning the SDMA type hangs that happen at random, and AMD_DEBUG=nodma seems to take care of it for almost anyone, I don't think using nongg is neccessary since until now it's only been proven to take care of 1 specific hang happening in Citra emulator, which is also ring-gfx type so it's a driver bug, probably not kernel driver related.
Comment 154 L.S.S. 2019-10-24 17:09:37 UTC
I'm not sure about how to locally pipe dmesg log to file so the moment when the system freezes could be captured.

And interestingly, the GCVM_L2_PROTECTION_FAULT errors that I saw from journalctl when it froze last time went missing somehow... maybe I mistook it, but whenever the system froze the following lines are guaranteed to show up (ring sdma0 timeout), so it's most likely sdma0 type.

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=151787, emitted seq=151789
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1838 thread Xorg:cs0 pid 1862

Currently the system is running okay as I haven't opened Nemo yet (which can almost 100% cause the freeze). Web browsers such as Firefox and Chrome currently don't cause the freeze.
Comment 155 jmsharvey771 2019-10-24 19:00:12 UTC
Some observations from me that may point to this being an OpenGL issue:

* Vulkan applications seem to work (mostly). I've not had a crash with the Dolphin Emulator with the Vulkan backend and Heat Signature, a game that runs through Proton. This doesn't explain Rise of the Tomb Raider though. I've also had freezing issues with Overwatch via Lutris/DXVK.

* Running freezing games in windowed mode stops hangs. In CS:GO, Minecraft, and Team Fortress 2, my system freezes in the menus. When I run them in windowed mode, they seem to run fine

* OpenGL games freeze after mouse input (for example, selecting a menu item). This is when CS:GO, TF2, and Minecraft freeze up. 

I am using Manjaro on kernel 5.4-rc4, mesa 19.2.1-2, vulkan-radeon (radv) 19.2.1-2 and xf86-video-amdgpu 19.1.0-1. I use KDE Plasma 5.17.1
Comment 156 Michael de Lang 2019-10-24 19:12:36 UTC
Just had a hang using 5.4.0-rc3, mesa 19.3~git1910171930.4b458b~oibaf~d, AMD_DEBUG="nodma nongg" while using firefox:

Oct 24 16:31:26 oipo-X570-AORUS-ELITE kernel: [27386.467009] broken atomic modeset userspace detected, disabling atomic
Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43796.470041] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773602] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=1756792, emitted seq=1756794
Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773683] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GPU Process pid 17048 thread firefox:cs0 pid 17134
Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773685] [drm] GPU recovery disabled.
Comment 157 Marko Popovic 2019-10-24 19:15:42 UTC
(In reply to Michael de Lang from comment #156)
> Just had a hang using 5.4.0-rc3, mesa 19.3~git1910171930.4b458b~oibaf~d,
> AMD_DEBUG="nodma nongg" while using firefox:
> 
> Oct 24 16:31:26 oipo-X570-AORUS-ELITE kernel: [27386.467009] broken atomic
> modeset userspace detected, disabling atomic
> Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43796.470041]
> [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed
> out!
> Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773602]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled
> seq=1756792, emitted seq=1756794
> Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773683]
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process GPU
> Process pid 17048 thread firefox:cs0 pid 17134
> Oct 24 21:04:58 oipo-X570-AORUS-ELITE kernel: [43798.773685] [drm] GPU
> recovery disabled.

Ok this doesn't sound right, how can you get an SDMA hang if you disable DMA completely. command should be:
AMD_DEBUG=nodma
not
AMD_DEBUG="nodma"
Comment 158 Konstantin Pereiaslov 2019-10-24 19:33:28 UTC
Also experiencing this with Radeon RX 5700 XT and amdgpu  19.1.0+git1910111930.b467d2~oibaf~b with kernel version 5.3.7-050307-generic running KDE Neon User edition with latest updates.

Didn't have any heavy load for the GPU to do.

First I had some artifacts appeared on Plasma Hard Disk Monitor widget and CPU Load Widget (here is a screenshot: https://i.perk11.info/20191024_193152_kernel.png) while PC was idle and screen was locked, but everything else continued to work fine. 

I checked the logs for the period when this could've happened, but the only logs from that period are from KScreen that start like this:

Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Property:  EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         State (newValue, Deleted):  1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputProperty (ignored)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Property:  EDID
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         State (newValue, Deleted):  1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         CRTC:  81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Mode:  97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Connection:  "Disconnected"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Subpixel Order:  0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRScreenChangeNotify
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Window: 18874373
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Root: 1744
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Size ID: 65535
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Size:  7280 1440
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         SizeMM:  1926 381
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper: RRNotify_OutputChange
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Output:  88
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         CRTC:  81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Mode:  97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Rotation:  "Rotate_0"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Connection:  "Disconnected"
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xcb.helper:         Subpixel Order:  0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: XRandROutput 88 update
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_connected: 0
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_crtc XRandRCrtc(0x5655577da9f0)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          MODE: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Connection: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Primary: false
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: Output 88 : connected = false , enabled = true
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]: kscreen.xrandr: XRandROutput 88 update
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_connected: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          m_crtc XRandRCrtc(0x5655577da9f0)
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          CRTC: 81
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          MODE: 97
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Connection: 1
Oct 24 16:34:58 perk11-home org.kde.KScreen[25804]:          Primary: false



90 minutes later, the system became unresponsive while I was typing a message in Skype, but the audio I had playing in Audacity continued to play and the cron jobs continued running normally for a few minutes while I was trying to get the system unstuck without rebooting it which I couldn't.

Here are the errors:

Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!




Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:10 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3485981, emitted seq=3485983
Oct 24 19:04:15 perk11-home kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2469 thread Xorg:cs0 pid 2491
Oct 24 19:04:15 perk11-home kernel: [drm] GPU recovery disabled.
Comment 159 Michael de Lang 2019-10-24 19:37:26 UTC
Thank you for making me look twice at the contents of the variable. Although the env variable is incorrect, the quotes don't do anything to the contents of the variable. Rather the error is in that it is not space- but comma-separated. For posterity, this means that I will now be running with AMD_DEBUG="nodma,nongg". 

Commenters #150 and #151 should also look into this.
Comment 160 L.S.S. 2019-10-25 01:18:08 UTC
I'll try AMD_DEBUG="nodma,nongg" when I get back.

Regarding this issue, is this issue mostly caused by the amdgpu driver itself, or caused by mesa? It seems more related to the driver as I have this same system freeze issue on both mesa from official Manjaro repo, as well as mesa-aco-git from AUR (which is a bit newer).

Speaking of rendering, how do current web browser render images/videos nowadays? I haven't gotten a single system freeze that was caused directly by the web browser (Firefox/Chromium) yet, so I'm curious, given the issue might be OpenGL-related.

So far the "ring sdma0 timeout" errors have been mostly caused by Nemo. Opening the file manager, browsing files, or simply leaving the file manager running can all cause the system to freeze at some point later.

By the way (off-topic), how's the issue on Wayland? And, does Cinnamon have proper support for Wayland and does anyone who's on Manjaro have experience on how to switch to Wayland from Xorg? I'm still unfamiliar about Wayland as I have never really used it (all the DEs I've been actively using, such as XFCE and Cinnamon, are still on X11/Xorg).
Comment 161 bugs 2019-10-25 03:26:18 UTC
> Regarding this issue, is this issue mostly caused by the amdgpu driver
> itself, or caused by mesa?

I tried to avoid freezes and uninstalled amdgpu, just using the modesetting driver for X11 - and got freezes. So I don't think it's a problem of amdgpu.

Maybe somebody could confirm this?
Comment 162 Shmerl 2019-10-25 03:28:05 UTC
I suppose it's a problem with radeonsi specifically. Hopefully AMD developers can clarify this.
Comment 163 L.S.S. 2019-10-25 13:16:30 UTC
Created attachment 145814 [details]
Newly captured GCVM_L2_PROTECTION_FAULT errors. This was captured on 5.4(rc) kernel, and with AMD_DEBUG=nodma.

I got a few more freezes when using Nemo. This time with AMD_DEBUG=nodma or AMD_DEBUG="nodma,nongg".

I put AMD_DEBUG to /etc/environment, and I can indeed confirm it from terminal (echo $AMD_DEBUG). It seems this doesn't work, as the freezes I got this time are also sdma0 type, same as before.

I also captured some new GCVM_L2_PROTECTION_FAULT errors. Not sure if they're different from last time. This is captured on 5.4(rc) kernel with AMD_DEBUG=nodma.

In the end, the sdma0 error doesn't seem to go away and I'm not even sure whether the parameter was set correctly. Where am I supposed to put the AMD_DEBUG parameters on Manjaro?
Comment 164 L.S.S. 2019-10-25 13:27:20 UTC
EDIT: Did some analysis myself about the GCVM_L2_PROTECTION_FAULT errors...

In the errors last time contained this:

src_id:0 ring:40 vmid:7 pasid:32769
GCVM_L2_PROTECTION_FAULT_STATUS:0x00741A51 (only on first error)

Whereas in the errors this time contained this:

src_id:0 ring:40 vmid:1 pasid:32769
GCVM_L2_PROTECTION_FAULT_STATUS:0x00141A51 (only on first error)

vmid became 1 and GCVM_L2_PROTECTION_FAULT_STATUS changed from 0x00741A51 to 0x00141A51. The rest of the first error remained the same.

MORE_FAULTS: 0x1
WALKER_ERROR: 0x0
PERMISSION_FAULTS: 0x5
MAPPING_ERROR: 0x0
RW: 0x1

In subsequent errors those values were all 0.

Both times the first error has a starting address of 0x00000318c00e7000.

Not sure if these could be of any help, though.
Comment 165 Shmerl 2019-10-25 14:49:53 UTC
(In reply to L.S.S. from comment #163)
> This was captured on 5.4(rc)

Just to clarify, do you have all the mentioned patches above applied? 5.4-rc4 already includes the mask patch, but not the other two.
Comment 166 Marko Popovic 2019-10-25 15:00:35 UTC
(In reply to Shmerl from comment #165)
> (In reply to L.S.S. from comment #163)
> > This was captured on 5.4(rc)
> 
> Just to clarify, do you have all the mentioned patches above applied?
> 5.4-rc4 already includes the mask patch, but not the other two.

Are you sure about that? I'm using 5.4 daily and I still get frequent freezes, which didn't happen even remotely as often with mask patch applied... when has it been accepted upstream?
Comment 167 Shmerl 2019-10-25 15:07:17 UTC
(In reply to Marko Popovic from comment #166)
> when has it been accepted upstream?

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7557d2783850eec199cae78dac561e9b7de181be
Comment 168 L.S.S. 2019-10-25 15:34:09 UTC
For the 5.4 kernel, I'm running 5.4-rc2 (from official Manjaro repo). Not sure when Manjaro Stable will receive its next update regarding kernels...
Comment 169 Marko Popovic 2019-10-25 15:35:33 UTC
(In reply to L.S.S. from comment #168)
> For the 5.4 kernel, I'm running 5.4-rc2 (from official Manjaro repo). Not
> sure when Manjaro Stable will receive its next update regarding kernels...

You can always compile Kernel-git but Manjaro should be decently fast to provide 5.4+ RC series.
Comment 170 Marko Popovic 2019-10-25 15:36:43 UTC
By the way if anyone is up for it, we can make a dedicated Discord chat room for Navi linux users, so we don't bloat this bugtracker, since a lot of the comments are just random questions etc. Let me know what you think
Comment 171 Shmerl 2019-10-25 15:42:38 UTC
(In reply to Marko Popovic from comment #170)
> By the way if anyone is up for it, we can make a dedicated Discord chat room
> for Navi linux users, so we don't bloat this bugtracker, since a lot of the
> comments are just random questions etc. Let me know what you think

I'd prefer something on Matrix (FOSS and open protocol after all). Not really using Discord.
Comment 172 Marko Popovic 2019-10-25 15:43:23 UTC
(In reply to Shmerl from comment #171)
> (In reply to Marko Popovic from comment #170)
> > By the way if anyone is up for it, we can make a dedicated Discord chat room
> > for Navi linux users, so we don't bloat this bugtracker, since a lot of the
> > comments are just random questions etc. Let me know what you think
> 
> I'd prefer something on Matrix (FOSS and open protocol after all). Not
> really using Discord.

Sure, I'm up for that!
Comment 173 Marko Popovic 2019-10-25 15:57:27 UTC
https://matrix.to/#/!UiDmeMlfsLndmzmPhp:matrix.org?via=matrix.org

Here is a link to Matrix community, anyone interested should try to join.
Comment 174 Shmerl 2019-10-25 16:03:06 UTC
Is it public? I can't join the room.
Comment 175 Marko Popovic 2019-10-25 16:06:18 UTC
(In reply to Shmerl from comment #174)
> Is it public? I can't join the room.

https://matrix.to/#/!XvwReLqAqwRmEzgmVh:matrix.org?via=matrix.org

Sorry, this should work
Comment 176 L.S.S. 2019-10-26 06:03:15 UTC
Unfortunately this still happens with Nemo on 5.4-rc4 kernel (official), after switching to Manjaro Testing channel.

The same ring sdma0 timeout error appears. An interesting phenomenon is that when the screen freezes (taskbar clock stopped changing), at first the mouse can still move, but after a few clicks the mouse stopped moving and the screen appears to have shifted to a previous frame before freezing completely:

The contents of the previous folder would reappear in Nemo, and the taskbar clock may sometimes move a second backwards.

I've removed AMD_DEBUG=nodma since it apparently doesn't work. If the patches are meant for 5.4-rc4, which patches are needed to address this problem?

For now I'm using nnn (a terminal-based file manager) for browsing files since terminals don't freeze the system... I'm not sure what might be triggering the freeze as all the lockups I have so far all happened when using Nemo. Other programs (including Firefox and Chromium) haven't triggered the freeze yet.
Comment 177 L.S.S. 2019-10-27 02:44:11 UTC
I'm still getting freezes when using Nemo with the same sdma0 timeout, on latest Manjaro 5.4 rc4 kernel built from latest PKGBUILD (which included the sdma0 fix commits) and after applying the sdma_read_delay patch.

Additionally, I discovered that changing system icon themes on Cinnamon can also trigger the freeze. Error codes are the same (ring sdma0 timeout).

Additionally, before this, last night I was able to generate a sdma1 error when browsing with Chromium. This time it states chromium instead of Xorg as process caused the ring timeout:

kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=2140606, emitted seq=2140608
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium pid 39450 thread chromium:cs0 pid 39509

It seems in all occurrences, the differences between emitted and signaled values are always 2.

Is there any process regarding this issue? Or is there any more information needed (and how to enable verbose logs in the system regarding amdgpu and related parts)?
Comment 178 L.S.S. 2019-10-27 03:10:09 UTC
Created attachment 145827 [details]
Errors captured with amdgpu.gpu_recovery=1

It seems GPU recovery is not yet ready for Navi. Just attempted to turn on that feature and when the freeze occurs, the screen turned black for a few seconds then returned and stayed frozen.

From the journalctl log it said the GPU recovery failed, and it followed with snd_hda_intel spamming errors then eventually crashed (which I think might be due to the HDMI/DP audio codec lost communication with the video card).
Comment 179 Shmerl 2019-10-28 21:43:53 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #33)
> Created attachment 145323 [details] [review] [review]
> wip patch
> 
> You can give a try to the attached kernel patch which hopefully could
> prevent some sdma timeouts.
> 
> I'm still testing it but the more testers the better :)

From the three patches, the mask patch is already upstreamed. Do you plan to upstream the other two for 5.4 cycle?
Comment 180 L.S.S. 2019-10-29 12:17:30 UTC
The other two patches do not fix the problem for me (sdma read delay and the wip patch). After applying these two patches (along with the mask patch which was already included upstream), I still get the same ring sdma0 timeout (process Xorg) freezes when using Nemo.
Comment 181 Daniel Suarez 2019-10-29 17:19:20 UTC
(In reply to L.S.S. from comment #180)
> The other two patches do not fix the problem for me (sdma read delay and the
> wip patch). After applying these two patches (along with the mask patch
> which was already included upstream), I still get the same ring sdma0
> timeout (process Xorg) freezes when using Nemo.

Don't feel left out. Those patches don't seem to work for almost anyone, at best it helps in some specific scenarios but they really don't do anything in terms of a proper solution/fix.
Comment 182 Shmerl 2019-10-29 17:30:15 UTC
Yep, even with 5.4-rc5 with those two extra patches applied, Firefox hangs randomly sometimes.
Comment 183 Timur Kristóf 2019-10-31 05:14:25 UTC
(In reply to Jaap Buurman from comment #142)
> How can I set both AMD_DEBUG=nongg and AMD_DEBUG=nodma in the
> /etc/environment file? Do they need to be on two separate lines, or will the
> second line simply overwrite the first one by setting the same environment
> variable? Do they need to be comma separated maybe?

Add the following line to your /etc/environment

export AMD_DEBUG=nongg,nodma

(In reply to Pierre-Eric Pelloux-Prayer from comment #141)
> I don't think radv uses SDMA at all, so they cannot be affected by this
> issue. 

Correct, radv doesn't use the SDMA so is not affected by this problem. If you see hangs in Vulkan games, it is currently most likely an LLVM problem. The LLVM devs have fixed most of the problems in their latest master, but haven't backported the fixes to LLVM 9 yet.
Comment 184 wychuchol 2019-10-31 11:54:22 UTC
Pop OS 19.10, latest Oibaf mesa (as of date) not sure what llvm, I'm kinda new at this and search "how to check my llvm version" didn't yield any results... Please be patient with me.

Anyway this happens frequently on rx 5700 xt, DDLC Monika's After Story mod (similar things occur with youtube videos, browsing internet - like trying to log in or opening a 

Oct 31 11:52:34 pop-os kernel: [  129.130712] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 31 11:52:34 pop-os kernel: [  133.994710] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=17012, emitted seq=17014
Oct 31 11:52:34 pop-os kernel: [  133.994747] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DDLC pid 3150 thread DDLC:cs0 pid 3168
Oct 31 11:52:34 pop-os kernel: [  133.994748] [drm] GPU recovery disabled.

Sometimes it's right away, sometimes it can run for maybe an hour or so but it does hang - everything besides the mouse pointer stops (but can't click on anything), can't change to system terminal via ctr+alt+F3, power button does not give a signal to shut down (I tried waiting for about 2 minutes maybe I needed to wait more but nothing really helps and REISUB doesn't seem to be working at all here or I'm doing it wrong) only option left being hard reset.
Comment 185 wychuchol 2019-10-31 12:00:10 UTC
I wrote a nice long post but for some reason my browser decided to refresh so it got dunked...

Anyways long story short:

RX 5700 XT, Pop OS 19.10, latest Oibaf mesa, I don't know how to check llvm version cause search engine gave me no answer but it's probably whatever got installed using this guide and updated: 
https://ubuntuforums.org/showthread.php?t=2425799

DDLC with Monika's After Story mod running natively
Oct 31 11:52:34 pop-os kernel: [  129.130712] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Oct 31 11:52:34 pop-os kernel: [  133.994710] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=17012, emitted seq=17014
Oct 31 11:52:34 pop-os kernel: [  133.994747] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process DDLC pid 3150 thread DDLC:cs0 pid 3168
Oct 31 11:52:34 pop-os kernel: [  133.994748] [drm] GPU recovery disabled.
Comment 186 wychuchol 2019-10-31 12:11:16 UTC
Forgot to add, Kernel v5.4-rc5.
Sorry for doublepost, if someone feels the need to delete that second message please do, I can't find a way to delete my own posts.
Comment 187 Konstantin Pereiaslov 2019-10-31 19:17:57 UTC
As recommended here I added AMD_DEBUG="nongg,nodma" to /etc/environment and additionally added export AMD_DEBUG="nongg,nodma" to ~/.profile just to be sure and for 5 days since that I only had one system freeze and it had a different journalctl message, so it did help me help with sdma0 timeout issue!
Comment 188 L.S.S. 2019-11-01 14:27:10 UTC
Not sure where the problem might be.

After installing 5.4-rc5, in addition to amdgpu-pro-libgl (and other amdgpu-pro related stuffs), I stopped encountering those dreaded "ring sdma0 timeout" freezes when using Nemo. I think amdgpu-pro stuffs might be what "fixed" it.

I'll test this for the time being. I cannot be confident that it would be completely fixed this way, but at least the situation has been improved to the point that Nemo is now usable again.
Comment 189 wychuchol 2019-11-01 16:29:04 UTC
(In reply to Konstantin Pereiaslov from comment #187)
> As recommended here I added AMD_DEBUG="nongg,nodma" to /etc/environment and
> additionally added export AMD_DEBUG="nongg,nodma" to ~/.profile just to be
> sure and for 5 days since that I only had one system freeze and it had a
> different journalctl message, so it did help me help with sdma0 timeout
> issue!

Thank you very much. I was afraid to try this since someone mentioned performance drops but I haven't noticed any in applications I use.
Comment 190 wychuchol 2019-11-01 19:20:06 UTC
Added AMD_DEBUG="nongg,nodma" to /etc/environment but it happened while opening a webm file in a new tab in Palemoon.
Nov  1 20:10:30 pop-os kernel: [24044.197839] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
Nov  1 20:10:30 pop-os kernel: [24049.317800] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=3673639, emitted seq=3673641
Nov  1 20:10:30 pop-os kernel: [24049.317836] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2350 thread Xorg:cs0 pid 2351
Nov  1 20:10:30 pop-os kernel: [24049.317838] [drm] GPU recovery disabled.

I'd think it happens less though.
Comment 191 wychuchol 2019-11-01 19:21:05 UTC
Oh and music player kept working, played next track from playlist and I managed to reset with REISUB.
Comment 192 Seba Pe 2019-11-01 20:16:04 UTC
(In reply to L.S.S. from comment #188)
> Not sure where the problem might be.
> 
> After installing 5.4-rc5, in addition to amdgpu-pro-libgl (and other
> amdgpu-pro related stuffs), I stopped encountering those dreaded "ring sdma0
> timeout" freezes when using Nemo. I think amdgpu-pro stuffs might be what
> "fixed" it.
> 
> I'll test this for the time being. I cannot be confident that it would be
> completely fixed this way, but at least the situation has been improved to
> the point that Nemo is now usable again.

As I said in comment #149 (https://bugs.freedesktop.org/show_bug.cgi?id=111481#c149), amdgpu-pro does not exhibit freezes or timeouts.

This appears to point to a problem in the generated instructions from libgl (or potentially a combination of that plus an underlying issue in the kernel driver).
Comment 193 wychuchol 2019-11-02 23:11:39 UTC
Perhaps needs another entry started but it's related (since it didn't happen before I tried RADV_PERFTEST=aco and AMD_DEBUG="nongg,nodma") so I'll post it in case someone has had same issues as me.

After some time in Witcher 3 GOTY run with Lutris PC restarts on it's own. I thought something is overheating (I've noticed graphic card memory in PSensor sometimes reaching 90 so I thought maybe that's what's happening) but I investigated kern.log and this always happened before that autonomous reset:

Nov  2 22:01:53 pop-os kernel: [  979.244964] pcieport 0000:00:01.1: AER: Corrected error received: 0000:01:00.0
Nov  2 22:01:53 pop-os kernel: [  979.244967] nvme 0000:01:00.0: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:   device [1987:5012] error status/mask=00001000/00006000
Nov  2 22:01:53 pop-os kernel: [  979.244968] nvme 0000:01:00.0: AER:    [12] Timeout               
Nov  2 22:01:53 pop-os kernel: [  979.262629] Emergency Sync complete

A solution I found is to add pci=nommconf in /etc/default/grub to the line 
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" (so it looks like this: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=nommconf").
Comment 194 Shmerl 2019-11-03 00:40:10 UTC
It sounds like NVMe problem, so not related to amdgpu?
Comment 195 L.S.S. 2019-11-03 01:26:50 UTC
It's possible that the GPU issues might be able to affect other things on the PCIe bus. With Radeon RX 5700 XT I'm also encountering some NVMe-related errors, but I don't think my NVMe drives have issues as they worked just fine before I installed this video card.

I recall if I don't power cycle the PC (not just pressing the reset button) when the freeze happens, one of my non-system NVMe drives would report "frozen state error detected, reset controller" errors (that the system would attempt to reset its controller, and it may still work), and some other NVMe drives might end up being unable to be detected by the system, unless I do a power cycle (a quick one is enough).
Comment 196 wychuchol 2019-11-03 04:16:39 UTC
(In reply to Shmerl from comment #194)
> It sounds like NVMe problem, so not related to amdgpu?

The thing is I played hours upon hours of Witcher 3 without any hangs or autonomous resets before until I added any lines to /etc/environment . Changing settings to make amdgpu work with more stability caused conflict so I'd propose it is related.
Comment 197 wychuchol 2019-11-04 16:12:30 UTC
Despite the 'fix' I posted in comment 193 AER PCI bus errors still happen, and autonomous resets happen as well. I think it's less frequent though. Still it's difficult to say for sure or put in a precise value.
Comment 198 Shmerl 2019-11-04 16:14:17 UTC
(In reply to wychuchol from comment #197)
> Despite the 'fix' I posted in comment 193 AER PCI bus errors still happen,
> and autonomous resets happen as well. I think it's less frequent though.
> Still it's difficult to say for sure or put in a precise value.

Could be a motherboard issue with PCIe 4.
Comment 199 Marko Popovic 2019-11-04 20:21:07 UTC
Created attachment 145882 [details]
Trace file from Blender SDMA hang

Here is a trace file of the SDMA hang provoked by using blender, happens pretty much all the time on the same place, so I guess those hangs look random on the surface but are reproducible indeed.
Comment 200 Marko Popovic 2019-11-04 20:37:41 UTC
(In reply to Marko Popovic from comment #199)
> Created attachment 145882 [details]
> Trace file from Blender SDMA hang
> 
> Here is a trace file of the SDMA hang provoked by using blender, happens
> pretty much all the time on the same place, so I guess those hangs look
> random on the surface but are reproducible indeed.

+ Extra info: it doesn't happen with nodma on... so it's definitely SDMA related, not shaders...
Comment 201 Daniel Suarez 2019-11-04 20:44:48 UTC
AMD has been pretty quiet here lately, has anyone tested with the 6th release candidate for kernel 5.4? AMD was present in the changelogs and they did some SDMA improvements, some mentioning that it fixes some freezes
Comment 202 Marko Popovic 2019-11-04 20:46:04 UTC
(In reply to Daniel Suarez from comment #201)
> AMD has been pretty quiet here lately, has anyone tested with the 6th
> release candidate for kernel 5.4? AMD was present in the changelogs and they
> did some SDMA improvements, some mentioning that it fixes some freezes

Last trace that I posted is done on 5.4 RC6 and MESA git...
Comment 203 Shmerl 2019-11-04 20:47:06 UTC
I'm running 5.4-rc6. No more hangs in Firefox at least, but that also could be due to me switching to Firefox nightly (stock, not the custom one I was testing before).
Comment 204 Daniel Suarez 2019-11-04 21:16:51 UTC
(In reply to Marko Popovic from comment #202)
> (In reply to Daniel Suarez from comment #201)
> > AMD has been pretty quiet here lately, has anyone tested with the 6th
> > release candidate for kernel 5.4? AMD was present in the changelogs and they
> > did some SDMA improvements, some mentioning that it fixes some freezes
> 
> Last trace that I posted is done on 5.4 RC6 and MESA git...

My bad I missed that. 

Shame, AMD really needs to get it together
Comment 205 wychuchol 2019-11-04 22:25:27 UTC
(In reply to Shmerl from comment #198)
> (In reply to wychuchol from comment #197)
> > Despite the 'fix' I posted in comment 193 AER PCI bus errors still happen,
> > and autonomous resets happen as well. I think it's less frequent though.
> > Still it's difficult to say for sure or put in a precise value.
> 
> Could be a motherboard issue with PCIe 4.

Perhaps. I've built this system on Tomahawk B450 MAX but I thought PCIe 4 isn't even enabled by default since it caused problems. How would I go about verifying if something uses PCIe 4?
Hmm there's a new BIOS available it seems, I'm running 7C02v33 and 7C02v34 has some NVMe compatibility updates. I'm gonna try it if I don't see people around internet wailing that it bricked their PCs.
Comment 206 Shmerl 2019-11-05 01:32:44 UTC
(In reply to wychuchol from comment #205)
> How would I go about verifying if something uses PCIe 4?

To avoid lengthy off-topic, I answered in the Matrix room (linked above).
Comment 207 Shmerl 2019-11-05 02:19:40 UTC
And I just got an sdma Firefox hang with 5.4-rc6. So while rate, it still happens.
Comment 208 Lazy 2019-11-05 09:23:09 UTC
Just making this note at the recommendation of another, I'm reproducing similar behavior across both Linux distributions, and Windows 10. The behavior is as follows:

Linux-Manjaro Linux kernel 5.4.0-1-MANJARO, Mesa 20.0.0-devel (git-dd77bdb34b), and LLVM 10.0.0 (compiled from Git master as I recall):
Boot, launch Overwatch, or SteamVR. Usually after a period of 1-2 hours, displays will stutter a few times, before a full hang, leaving the last rendered frame on each display.

Windows 10: latest insider build as of 11/5/2019:
Similar behavior in the end, aside from the duration of stability being 3-4 hours it seems. Launching SteamVR, I can run for 3-4 hours, and then it stutters, hangs for a few seconds, then recovers. Then it'll do the same a few moments later with a longer duration before the recovery. After this repeats a few times, the display either hangs on the last frame, or all displays go black. After this, I have to hard-shutdown the same way as I do for Manjaro. 

This may not be the exact same behavior, but I don't know of a way to log this particular behavior in Windows.
Comment 209 L.S.S. 2019-11-06 00:43:19 UTC
Really?!

Although I haven't really used the card under Windows, if similar behaviors happen on Windows as well then either something's really really wrong here.

I haven't tested gaming on Manjaro yet, but at least with amdgpu-pro stuffs on Manjaro the sdma0 freezes with Nemo stopped happening.

On the other hand, video card recovery is not yet matured on Linux yet, but on Windows it has already been available thanks to the WDDM, though you cannot completely rely on it, as some apps can still misbehave if the driver has been crashed for at least once in the system lifecycle, and it may eventually fail to recover at some point later on.

Which brand of the Radeon RX 5700/XT are you using? For me I'm using a 50th Anniversary edition. How's the thermal condition when you play games on the card? It's possible the card might have weird behavior if it's under load with temperature near triple digits (something that I personally would never allow).

I have a PCI slot fan set (consists of 3 slim fans which is around the same length as the card itself) placed beneath the card, blowing upwards, and it seems very effective. With the help of its own blower fans, the card maintains a steady 50 celsius under load.
Comment 210 Lazy 2019-11-06 08:38:20 UTC
To clarify, first: it's an Asus reference (blower-style) 5700XT
I can't use the overclock utilities without a crash coming within the hour on Windows 10 or any of my Linux installs, no fan profiles, no manual control of fans, no setting it to "high performance" on the dynamic clock or it crashes within the hour. No exceptions, no setting then resetting the setting to default to get around it.

Generally speaking, it maxes around 75C, but that's mostly due to the default fan profile only ramping up enough to negate further gains at that point (I'm guessing that's to do with trying to keep the card quiet). If I supply cool air, it'll slow the fan, and the heat still comes eventually.

Some things that may or may not be relevant:

This card crashes mostly around times that the clock rate adjusts more often; If the card goes from, say, max freq to a step below and back, there's a chance of a crash. (maybe coincidence, maybe not, I don't know to be bluntly honest)
This is a constant I've noticed on both OSes. Windows 10 tends to keep things relatively stable in that regard, while Manjaro tends to see a lot of spiking and sudden drops. SteamVR definitely instigates that kind of behavior in my experience on my old Vega 56 as well (Which with nodma set on Navi, is actually not much different tbh). Probably explains why ever since the latest set of patches, the majority of the time it crashes is after an hour or two of gameplay in Manjaro. (also no idea why Manjaro switches more often..)

To be blunt, though, in both OSes, seemingly random hangs are also a common occurrence for me. I had Win10 just yesterday, hang completely, no recovery, simply animating a minimizing window as SteamVR first opened. Granted, this also coincided with a rapid up-tick in clock speed most likely, as I've observed this massive spike on launching SteamVR via GPU adjusting utilities before I realized they instigate the issue as well.

Setting nodma does get rid of some of the more random crashes, but these ones stick around in my experience so far. Maybe 75C is a bit high, but in neither OS can I manage to adjust the fans without the same issue, so.. No idea what to do, here.
Comment 211 Marco Liedtke 2019-11-06 09:40:38 UTC
Hi folks,

i am new to bugreporting, but due to having a new system and this bug, i want to contribute something to this situation.

I have almost the same behavior as stated in comment 1.

My Xorg freezes every session no matter what i do. I could only get the last dmesg befor i had to hard reboot over ssh, cause this was the only thing working.

2 Examples:
[ 1184.577790] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 1189.697729] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=53043, emitted seq=53045
[ 1189.697797] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1398 thread Xorg:cs0 pid 1409
[ 1189.697799] [drm] GPU recovery disabled.


[ 708.286318] [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[ 713.406528] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=104848, emitted seq=104850
[ 713.406594] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 1402 thread Xorg:cs0 pid 1414
[ 713.406596] [drm] GPU recovery disabled.


I have already set AMD_DEBUG=nodam in /etc/environment and in ~/.profile.
Last time i played World of Tanks via Wine and DXVK the same freeze occured, again the same error that xorg pid timed out...

It is happenening after 1 Minute logged in or 1 hour.

My System specs are:
R7 3700x
Powercolor R5700XT Red Dragon Silent Bios enabled
Gigabyte X570 I Aourus Pro WIFI
UBUNTU 18.04.3 LTS with Kernel 5.3.8 and Padoka unstable PPA (Mesa 19.3)

I have no NVME SSD and i have no Monitoring applications running.

Tests done:

-With Kernel 4.15 standrad Ubuntu Kernel and AMDGPU-PRO installed, everything runs fine without a freeze.
- With Kernel 4.18 and Mesa 19.0.8 no freezes occured, kernel does not recognize rx5700, so no amdgpu modul is loaded.

freezes occured with kernel 5.3.7 and 5.3.8 and in combination with padoka and oibaf ppa (Mesa 19.3).

If i can help with further information pls guide me to dig in my system the infos u need.
Comment 212 wychuchol 2019-11-06 13:43:49 UTC
(In reply to Marco Liedtke from comment #211)
> 
> I have already set AMD_DEBUG=nodam in /etc/environment and in ~/.profile.
> Last time i played World of Tanks via Wine and DXVK the same freeze occured,
> again the same error that xorg pid timed out...

Don't know if you made a typo here but do you have AMD_DEBUG="nongg,nodma" line in /etc/environment ? Bugs still occur for me but they're far less frequent.
Also since you're running ryzen 3000 try to get kernel 5.4. It won't solve your problems but there's a massive performance buff for zen2 in 5.4.
Comment 213 Marco Liedtke 2019-11-06 19:39:53 UTC
(In reply to wychuchol from comment #212)
> (In reply to Marco Liedtke from comment #211)
> > 
> > I have already set AMD_DEBUG=nodam in /etc/environment and in ~/.profile.
> > Last time i played World of Tanks via Wine and DXVK the same freeze occured,
> > again the same error that xorg pid timed out...
> 
> Don't know if you made a typo here but do you have AMD_DEBUG="nongg,nodma"
> line in /etc/environment ? Bugs still occur for me but they're far less
> frequent.
> Also since you're running ryzen 3000 try to get kernel 5.4. It won't solve
> your problems but there's a massive performance buff for zen2 in 5.4.

Hi, i have noch kernel 5.4 rc6 installed and the problem didnt change.
I have written AMD_DEBUG=nodma and NOT AMD_DEBUG="nodma" in /etc/environment.

Now i have added the amdgpu.gpu_recovery=1 attribute in grub.

So now there is a long output from dmesg while nothing done then clicking "login" in bugzilla with firefox.

see attachment dmesg_with_gpu_recovery enabled....

I hope this helps a bit...
Comment 214 Marco Liedtke 2019-11-06 19:41:09 UTC
Created attachment 145904 [details]
dmesg with gpu recovery enabled
Comment 215 Shmerl 2019-11-07 00:41:09 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #141)
> If anyone has a reliable way to trigger the issue, the most helpful thing to
> do for now is an apitrace capture.

Does the trace in comment #199 help to narrow it down?

https://bugs.freedesktop.org/show_bug.cgi?id=111481#c199
Comment 216 lptech1024 2019-11-07 05:12:27 UTC
GPU: PowerColor Red Devil Radeon RX 5700 XT using OC BIOS (default)

Stock Fedora 31: Kernel 5.3.8, GNOME 3.34, Mesa 19.2.2, linux-firmware 20190923, LLVM 9.0.0

I experienced frequent hangs using X.org Gnome (Kernel 5.3.7, > Mesa 19.2.0), especially interacting with graphical file manager-related operations .

Wayland Gnome is much more stable, although I experienced a hang today after being powered on for almost two hours (45 minutes idle, 75 minutes with high GPU load). Hang occurred during a gaming cutscene.

All messages contained an identical timestamp:

Nov 06 [SNIP] kernel: [drm] GPU recovery disabled.
Nov 06 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ShadowOfTheTomb pid 16893 thread WebViewRenderer pid 16939
Nov 06 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2827901, emitted seq=2827903
Nov 06 [SNIP] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Comment 217 Shmerl 2019-11-07 05:20:29 UTC
(In reply to lptech1024 from comment #216)
>
>Hang occurred during a gaming cutscene.
> 
>...
> Nov 06 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring
> gfx_0.0.0 timeout, signaled seq=2827901, emitted seq=2827903
> Nov 06 [SNIP] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR*
> Waiting for fences timed out or interrupted!

If you can reproduce it, please report this to radeonsi bug tracker (and attach an apitrace please).

https://gitlab.freedesktop.org/mesa/mesa/issues

Also, please add details on what game it is (and etc.) here:

https://www.gamingonlinux.com/wiki/Mesa_Broken
Comment 218 L.S.S. 2019-11-07 05:35:24 UTC
It seems the page fault issue has been already reported here. I also found similar page faults in the log sometimes when the lockup occurred (I think it'll definitely show up if I leave the system as is for a prolonged amount of time).

https://gitlab.freedesktop.org/mesa/mesa/issues/2053

I'm not an expert of apitrace, but the reporter provided a trace that would 100% reproduce the lockup, and he was able to bisect the call that caused the lockup which is the last call of that trace file.
Comment 219 Shmerl 2019-11-07 05:37:34 UTC
(In reply to L.S.S. from comment #218)
> 
> I'm not an expert of apitrace, but the reporter provided a trace that would
> 100% reproduce the lockup, and he was able to bisect the call that caused
> the lockup which is the last call of that trace file.

There could be multiple reasons for such hangs, so just please report one separately if you can reproduce others.
Comment 220 Marco Liedtke 2019-11-08 21:57:14 UTC
Created attachment 145917 [details]
dmesg of new sdma0 error while watching youtube with firefox, mainline kernel 5.3.9, padoka ppa mesa 19.3

Hi,

after installing and testing some configurations, amdgpu pro with amdvlk and kernel 4.15 (working..) and getting back to radv, cause only radv has no graphical issues with World of Tanks (wine + dxvk). 
I have another dmesg output....btw /etc/environment has "export AMD_DEBUG=nodma" included and it works for me that i can use the pc for 1 or 2 hours...much better then before...

so the attachment has many infos from the hang including sdma0 failure...maybe this helps...
Comment 221 William Casarin 2019-11-08 23:38:54 UTC
mesa 19.3.0-rc2 + RADV_PERFTEST=aco fixed this for me
Comment 222 Marko Popovic 2019-11-09 12:39:20 UTC
(In reply to William Casarin from comment #221)
> mesa 19.3.0-rc2 + RADV_PERFTEST=aco fixed this for me

ACO should have no impact on SDMA. Firstly OpenGL still uses LLVM, and OpenGL is the only one using SDMA in the first place, radv doesn't. So you must be talking about some different kinds of hangs, probably the ring_gfx types.
Comment 223 lptech1024 2019-11-09 17:57:57 UTC
Followup to #216:

Fedora 31: Kernel 5.3.9, GNOME 3.34, Mesa 19.2.2, linux-firmware 20190923, LLVM 9.0.0

The hang is 100% reproducible.

It occurs running the Linux-native (Vulkan) version of Shadow of the Tomb Raider (SotTR). I have never run SotTR under Proton/Wine, so that isn't a confounding variable.

The (unskippable) cutscene is for the Amazon River in Peru and occurs anywhere between 15 seconds before the pilot is struck and the pilot is struck. Even when the video hangs, you can usually hear fragments (sound effects) of the game for a few seconds afterwords.

I ran SotTR with vktrace and activated the Gnome (Wayland) overview to see if there I could catch any relevant terminal output (none that I saw). The game still had focus, so it continued playing. After the hang (when I rebooted), there wasn't a vktrace file. I would assume this would be either it didn't write it out due to the hang or it didn't have content to write.

However, with it running visible in the overview (and a manual kernel update), I got both ring gfx and sdma errors:

Nov 07 [SNIP]:24 [SNIP] kernel: [drm] GPU recovery disabled.
Nov 07 [SNIP]:24 [SNIP] kernel: [drm] GPU recovery disabled.
Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1722 thread gnome-shel:cs0 pid 1768
Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=1049, emitted seq=1053
Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=30017, emitted seq=30020
Nov 07 [SNIP]:19 [SNIP] kernel: [drm] GPU recovery disabled.
Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process ShadowOfTheTomb pid 3890 thread WebViewRenderer pid 4981
Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=75610, emitted seq=75612
Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!

As a workaround to proceed in the game, I downloaded the AMDVLD 2019.Q4.2 .deb, extracted the contents, modified the JSON file (to point to the local amdvlk64.so), and ran SotTR with the VK_ICD_FILENAMES variable set to the AMDVLK JSON file.

The AMDVLK graphics were terrible (significant percentage of random pixels turning random colors, bad rendering of elements, etc), but I did not experience any hangs during the cutscene. After reaching a known save point, I switched back to mesa/RADV-llvm and haven't experienced a hang since (haven't progressed that much further yet, but that's the only hang so far - about 13% of the game has been completed).

This would seem to point to a bug at least partially due to mesa/RADV-llvm.
Comment 224 Marko Popovic 2019-11-10 12:26:53 UTC
(In reply to lptech1024 from comment #223)
> Followup to #216:
> 
> Fedora 31: Kernel 5.3.9, GNOME 3.34, Mesa 19.2.2, linux-firmware 20190923,
> LLVM 9.0.0
> 
> The hang is 100% reproducible.
> 
> It occurs running the Linux-native (Vulkan) version of Shadow of the Tomb
> Raider (SotTR). I have never run SotTR under Proton/Wine, so that isn't a
> confounding variable.
> 
> The (unskippable) cutscene is for the Amazon River in Peru and occurs
> anywhere between 15 seconds before the pilot is struck and the pilot is
> struck. Even when the video hangs, you can usually hear fragments (sound
> effects) of the game for a few seconds afterwords.
> 
> I ran SotTR with vktrace and activated the Gnome (Wayland) overview to see
> if there I could catch any relevant terminal output (none that I saw). The
> game still had focus, so it continued playing. After the hang (when I
> rebooted), there wasn't a vktrace file. I would assume this would be either
> it didn't write it out due to the hang or it didn't have content to write.
> 
> However, with it running visible in the overview (and a manual kernel
> update), I got both ring gfx and sdma errors:
> 
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm] GPU recovery disabled.
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm] GPU recovery disabled.
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process  pid 0 thread  pid 0
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process gnome-shell pid 1722 thread gnome-shel:cs0 pid
> 1768
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring sdma1 timeout, signaled seq=1049, emitted seq=1053
> Nov 07 [SNIP]:24 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring sdma0 timeout, signaled seq=30017, emitted seq=30020
> Nov 07 [SNIP]:19 [SNIP] kernel: [drm] GPU recovery disabled.
> Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process ShadowOfTheTomb pid 3890 thread WebViewRenderer
> pid 4981
> Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx_0.0.0 timeout, signaled seq=75610, emitted seq=75612
> Nov 07 [SNIP]:19 [SNIP] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]
> *ERROR* Waiting for fences timed out or interrupted!
> 
> As a workaround to proceed in the game, I downloaded the AMDVLD 2019.Q4.2
> .deb, extracted the contents, modified the JSON file (to point to the local
> amdvlk64.so), and ran SotTR with the VK_ICD_FILENAMES variable set to the
> AMDVLK JSON file.
> 
> The AMDVLK graphics were terrible (significant percentage of random pixels
> turning random colors, bad rendering of elements, etc), but I did not
> experience any hangs during the cutscene. After reaching a known save point,
> I switched back to mesa/RADV-llvm and haven't experienced a hang since
> (haven't progressed that much further yet, but that's the only hang so far -
> about 13% of the game has been completed).
> 
> This would seem to point to a bug at least partially due to mesa/RADV-llvm.

radv related hangs got fixed in Mesa 20 git series, this thread is more concerned with SDMA kernel-driver hangs.
Comment 225 John Smith 2019-11-10 12:42:54 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #141)

> For radeonsi the AMD_DEBUG=nodma environment variable is a workaround until
> we figure out a proper fix.

Is this seriously what AMD calls "support"? No offense but this is ridiculous, this card has been out for four months and it still can't even browse firefox reliably, even after these "workarounds" and "patches". 

Then we waited two months for the drivers to even get properly released, and all this wait was for nothing because the drivers are useless, you can't even browse firefox or let alone play any actual games. What is the point of having open source drivers if they don't even work? Nvidia's GPUs have had day one support, and unlike AMD, "support" actually means the GPU works for something that is meaningful.
Comment 226 William Casarin 2019-11-10 14:15:22 UTC
(In reply to Marko Popovic from comment #222)
> (In reply to William Casarin from comment #221)
> > mesa 19.3.0-rc2 + RADV_PERFTEST=aco fixed this for me
> 
> ACO should have no impact on SDMA. Firstly OpenGL still uses LLVM, and
> OpenGL is the only one using SDMA in the first place, radv doesn't. So you
> must be talking about some different kinds of hangs, probably the ring_gfx
> types.

you're right, I wasn't aware that this thread was only for sdma related hangs. The
Comment 227 John H 2019-11-11 02:50:13 UTC
Hi all.

For the last couple weeks I have been following this thread and just wanted to reprot my experiences findings. First off, my machine's specs:

AMD Ryzen 3700X
Aorus X570 Pro Wifi motherboard
32 GB (16x2) DDR4 3200 RAM
PowerColor Red Devil 5700XT Graphics
Various SSD / HDD all on SATA.
Windows 10 / Debian Sid

Debian Sid: Kernel 5.3.10, Mesa 19.2.3, LLVM 9 as of writing this.

In the whole time I have had this graphics card (October 21 onwards) I dont think I have had any crashes / freezes on the desktop or during browsing through Chromium. However, I have hard freezes when playing games. A specific one I can reproduce EVERY. SINGLE. TIME. was when playing Unreal Tournament 3 via Steam proton. The "Shangri La" map i encountered lockups anywhere from a few seconds to a few minutes into the game. Forcing me to hit the reset button. I was able to SSH in via my phone before resetting and looking at dmesg said something about amdgpu GPU recovery failed. 

My 5700XT, has a dual BIOS's. One overclocked, the other for "silent". By default the switch was in the OC position, earlier today I flipped it to silent. and since then, NO freezes in UT whatsoever! I figured the factory overclock PowerColor implemented on this card was just a touch too high and is therefore unstable. Forza 6 Apex in Windows 10 also hard freezes my PC, forcing me to reset. That problem also has been eliminated since flipping the switch. A slight performance loss but I'll take the stability anyday.


TL;DR - If your Navi card has dual BIOS, try switching to the lower clocked BIOS if you haven't already. it may just help. Certainly, I'll report back if I find any other issues in Debian that is linked to this gfx card
Comment 228 Shmerl 2019-11-11 03:01:20 UTC
(In reply to John H from comment #227)
>
> specific one I can reproduce EVERY. SINGLE. TIME. was when playing Unreal
> Tournament 3 via Steam proton. The "Shangri La" map i encountered lockups
> anywhere from a few seconds to a few minutes into the game. Forcing me to
> hit the reset button. 

This could be a llvm / Mesa bug, not the kernel one. If you can reproduce it, please report it for that game individually to the Mesa bug tracker, with an apitrace.
Comment 229 Marko Popovic 2019-11-11 08:05:07 UTC
(In reply to Shmerl from comment #228)
> (In reply to John H from comment #227)
> >
> > specific one I can reproduce EVERY. SINGLE. TIME. was when playing Unreal
> > Tournament 3 via Steam proton. The "Shangri La" map i encountered lockups
> > anywhere from a few seconds to a few minutes into the game. Forcing me to
> > hit the reset button. 
> 
> This could be a llvm / Mesa bug, not the kernel one. If you can reproduce
> it, please report it for that game individually to the Mesa bug tracker,
> with an apitrace.

And make sure to NOT report it for the MESA version as old as 19.2.3... only report the bug if you're running current 19.3 RC series or 20 git series... because a lot of those might have already been fixed.

best regards
Comment 230 Daniel Suarez 2019-11-14 00:44:12 UTC
(In reply to John Smith from comment #225)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #141)
> 
> > For radeonsi the AMD_DEBUG=nodma environment variable is a workaround until
> > we figure out a proper fix.
> 
> Is this seriously what AMD calls "support"? No offense but this is
> ridiculous, this card has been out for four months and it still can't even
> browse firefox reliably, even after these "workarounds" and "patches". 
> 
> Then we waited two months for the drivers to even get properly released, and
> all this wait was for nothing because the drivers are useless, you can't
> even browse firefox or let alone play any actual games. What is the point of
> having open source drivers if they don't even work? Nvidia's GPUs have had
> day one support, and unlike AMD, "support" actually means the GPU works for
> something that is meaningful.

I wouldn't really call what is happening here "support". Really feels like us Linux users were thrown to the side with little consideration.
Comment 231 Sander Lienaerts 2019-11-15 20:10:58 UTC
Been following this thread for a while now. Can't believe this has been known for 3 months, without a fix released.

Just a moment ago a random freeze occurred running Firefox and other applications, no games. Spotify kept playing in the background. Cursor not moving and unable to open another shell.

This happened with AMD_DEBUG="nongg,nodma" enabled. Running kernel 5.4rc7 and Mesa 19.2.4.

Here is an output of the log before reboot:

nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32769, for process Xorg pid 811 thread Xorg:cs0 pid 974)
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:   in page starting at address 0x00000318c00e7000 from client 27
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00541C51
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MORE_FAULTS: 0x1
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          WALKER_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          PERMISSION_FAULTS: 0x5
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MAPPING_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          RW: 0x1
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32769, for process Xorg pid 811 thread Xorg:cs0 pid 974)
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:   in page starting at address 0x00000318c00e6000 from client 27
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MORE_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          WALKER_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          PERMISSION_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MAPPING_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          RW: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32769, for process Xorg pid 811 thread Xorg:cs0 pid 974)
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:   in page starting at address 0x00000318c00e9000 from client 27
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MORE_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          WALKER_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          PERMISSION_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MAPPING_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          RW: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32769, for process Xorg pid 811 thread Xorg:cs0 pid 974)
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:   in page starting at address 0x00000318c00e8000 from client 27
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MORE_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          WALKER_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          PERMISSION_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MAPPING_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          RW: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:5 pasid:32769, for process Xorg pid 811 thread Xorg:cs0 pid 974)
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:   in page starting at address 0x00000318c00ea000 from client 27
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00000000
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MORE_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          WALKER_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          PERMISSION_FAULTS: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          MAPPING_ERROR: 0x0
nov 15 20:47:58 sander-pc kernel: amdgpu 0000:0a:00.0:          RW: 0x0
nov 15 20:48:09 sander-pc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
nov 15 20:48:09 sander-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=6760, emitted seq=6763
nov 15 20:48:09 sander-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 811 thread Xorg:cs0 pid 974
nov 15 20:48:09 sander-pc kernel: [drm] GPU recovery disabled.
Comment 232 viste.sylvain 2019-11-15 21:29:19 UTC
(In reply to Sander Lienaerts from comment #231)
> Been following this thread for a while now. Can't believe this has been
> known for 3 months, without a fix released.
> 
> Just a moment ago a random freeze occurred running Firefox and other
> applications, no games. Spotify kept playing in the background. Cursor not
> moving and unable to open another shell.
> 
> This happened with AMD_DEBUG="nongg,nodma" enabled. Running kernel 5.4rc7
> and Mesa 19.2.4.

I'm currently using kernel 5.4 and mesa-git (using lcarlier repo, it's written mesa 20 but there is no mesa 20 on the git repository so ...) on Arch and I'm not having any hang or freeze so it seems to be fixed but maybe I'm lucky.
Comment 233 Alex Deucher 2019-11-16 16:22:22 UTC
Does attachment 145971 [details] [review] help?
Comment 234 Timur Kristóf 2019-11-16 17:53:00 UTC
(In reply to John H from comment #227)
> However, I have hard freezes when playing games. A
> specific one I can reproduce EVERY. SINGLE. TIME. was when playing Unreal
> Tournament 3 via Steam proton.

Sounds like the same, or similar issue as this one:
https://gitlab.freedesktop.org/mesa/mesa/issues/868

In that case it was caused by an LLVM bug that has been fixed in LLVM 10 for a while but haven't made it into LLVM 9 yet.
If you use mesa 19.3 can you try if the same issue occours with ACO?

(In reply to John Smith from comment #225)
> Is this seriously what AMD calls "support"? No offense but this is
> ridiculous, this card has been out for four months and it still can't even
> browse firefox reliably, even after these "workarounds" and "patches". 

I can symphatize with your frustration, but I don't think this attitude is helpful. Pierre-Eric and Alex are doing their best to solve this problem. Insulting each other in the bugzilla is not constructive and won't bring us closer to the solution.
Comment 235 Marko Popovic 2019-11-16 17:58:05 UTC
(In reply to Alex Deucher from comment #233)
> Does attachment 145971 [details] [review] [review] help?

No, this is for flip hangs that only happen in some games, random SDMA hangs are still present, but SDMA is disabled in MESA20 so for the timebeing it should be more stable.

(In reply to Timur Kristóf from comment #234)
> (In reply to John H from comment #227)
> > However, I have hard freezes when playing games. A
> > specific one I can reproduce EVERY. SINGLE. TIME. was when playing Unreal
> > Tournament 3 via Steam proton.
> 
> Sounds like the same, or similar issue as this one:
> https://gitlab.freedesktop.org/mesa/mesa/issues/868
> 
> In that case it was caused by an LLVM bug that has been fixed in LLVM 10 for
> a while but haven't made it into LLVM 9 yet.
> If you use mesa 19.3 can you try if the same issue occours with ACO?
> 

Radv hangs are not related to SDMA hangs, but luckily at least those are fixed in LLVM10, so we can at least have decently stable experience with AMD_DEBUG=nodma, which is basically enabled by default in MESA 20.
Comment 236 Alex Deucher 2019-11-18 15:11:30 UTC
(In reply to Marko Popovic from comment #235)
> (In reply to Alex Deucher from comment #233)
> > Does attachment 145971 [details] [review] [review] [review] help?
> 
> No, this is for flip hangs that only happen in some games, random SDMA hangs
> are still present, but SDMA is disabled in MESA20 so for the timebeing it
> should be more stable.

They may be related. If the SDMA is waiting on a fence from the display engine it would time out if that display fence never triggers.
Comment 237 Tobias Frisch 2019-11-19 01:21:25 UTC
Hardware:
- Asus ROG Crosshair VI Extreme
- AMD Ryzen 7 2700X
- Sapphire Radeon RX 5700

Software:
- linux 5.3.11-arch1-1
- mesa 19.2.4-1

I just tried to encounter some hangs again which occur relative randomly using Arch. So I started Steam and tried some benchmarks in Shadow of the Tombraider. It fully completed it on highest settings with a high FPS score but it lagged quite hard (even stuttered one time for 1~3 seconds) during displaying.

I just hope/guess the wrong/lying fps-counter in SoTR is not related to the amdgpu drivers, isn't it?

Anyhow starting Rise of the Tombraider after it then froze my system again.

[14494.683266] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[14494.683354] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[14499.803441] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2989148, emitted seq=2989150
[14499.803522] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 414233 thread RiseOfTheT:cs0 pid 414239
[14499.803525] [drm] GPU recovery disabled.

I still have one question.. how is the communication with AMD in these issues? Because somehow (I would like to know) their drivers work on my Ubuntu 18.04 LTS without any freezes so far (except from starting Blender). I use it at the moment to get something done without worrying about random freezes (I had one this day using Arch with linux 5.4.0-rc7-mainline). I hope these issues are fixed soon.
Comment 238 Shmerl 2019-11-19 01:29:55 UTC
(In reply to Tobias Frisch from comment #237)
> - linux 5.3.11-arch1-1
> - mesa 19.2.4-1
> 

That's really not a good idea. You'd need 5.4 with that flip patch applied and Mesa 20 (i.e. master) with llvm 20 if you want to avoid as many hangs as possible.
Comment 239 Shmerl 2019-11-19 01:30:33 UTC
*llvm10 I mean
Comment 240 Martin Peres 2019-11-19 09:50:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/892.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.