Bug 111481 - AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa 19.2 -git/llvm9
Summary: AMD Navi GPU frequent freezes on both Manjaro/Ubuntu with kernel 5.3 and mesa...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set critical
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-25 00:50 UTC by Marko Popovic
Modified: 2019-09-17 21:24 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Merge last adg5f code (22.82 MB, patch)
2019-08-31 22:15 UTC, Mathieu Belanger
no flags Details | Splinter Review
APITrace log from Citra crash (56.68 MB, application/octet-stream)
2019-09-02 08:25 UTC, Marko Popovic
no flags Details
APITrace log from RocketLeague crash (581.72 KB, application/octet-stream)
2019-09-02 09:13 UTC, Marko Popovic
no flags Details
wip patch (1.01 KB, patch)
2019-09-10 15:23 UTC, Pierre-Eric Pelloux-Prayer
no flags Details | Splinter Review
UMR dump of registers on a GPU lockup (187.27 KB, text/plain)
2019-09-10 18:25 UTC, Alexandr Kára
no flags Details
umr output of sdma0/sdma1 after RotTR freeze (20.22 KB, application/gzip)
2019-09-10 21:02 UTC, Sebastian Meyer
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marko Popovic 2019-08-25 00:50:43 UTC
I've tried my AMD Radeon RX 5700 XT on both ubuntu (llvm 9 / mesa 19.3 - Oibaf PPA) and Manjaro (llvm 10 git / mesa-git).
On both I've been using Gnome shell and in both cases I had frequent lockups and freezes. Once my GPU disconnected to Monitor and remained so until I rebooted, other times desktop would just freeze and crash the whole system.

Software tried: LLVM 10 git / MESA 19.3 - git on Manjaro
                LLVM 9 / MESA 19.3 git from Oibaf PPA
Kernels tried: Manjaro 5.3 RC4, Ubuntu 5.3 RC5 generic, Ubuntu drm-tip 5.3 daily

Error log:
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] ERROR Waiting for fences timed out or interrupted!
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR ring gfx_0.0.0 timeout, signaled seq=94235, emitted seq=94237
avg 24 22:53:58 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] ERROR Process information: process citra-qt pid 27356 thread citra-qt:cs0 pid 27366

Happened on all setups, bug was pretty much the same, lockups weren't extremely frequent but frequent enough that they were very noticable (5-6 freezes per day on average)

Faulty hardware is probably out of options since I never had a hiccup or anything even close to crash or freeze on my Windows desktop.
Comment 1 Marko Popovic 2019-08-25 17:10:52 UTC
Adding error log from Manjaro:
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=1742, emitted seq=1743
avg 23 16:05:37 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 975 thread gnome-shell:cs0 pid 988
avg 23 16:05:37 Marko-PC kernel: [drm] GPU recovery disabled.

Pretty much same-type error happens in different situations and very often at random while using the desktop. These 2 logs one is from launching an OpenGL from Citra emulator which is reproducable every time and the second one from Manjaro is while browsing the Gnome shell and it would crash without any clear triggers.
Comment 2 Mathieu Belanger 2019-08-28 15:39:43 UTC
I confirm that I have this bug or a very similar one.

It, for some reason, happens most when i'm using my IDE (Intellij based).
It will append the most when I type code and the crash occur when the IDE is supposed to propose some code completion.

I do have one to two crash a day.

Video card is RX5700
CPU is Ryzen R7-2700X

Software tested LLVM 9 git
libdrm, mesa, ddx updated from GIT very frequently.

Bug is there since I have the card, like 3 weeks ago.
Comment 3 Matthias Müller 2019-08-30 22:07:20 UTC
I don't know if i'm encountering the same bug, but it is at least similar.
I don't get hard freezes/lockups, but i get a strange "stutterting", as if the whole OS halted for a few seconds, then continued for a few seconds...and the halted times grew while the "usable seconds" got shorter quickly to the point of unusability...

It doesn't happen regularly (seems like anything between 30min and 120min) and i haven't yet made out a direct cause, but in journalctl, it seems the same messages appear every time when it begins:

kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
 kernel: amdgpu 0000:0f:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0)
 kernel: amdgpu 0000:0f:00.0:   at page 0x0000600000fd6000 from 18
 kernel: amdgpu 0000:0f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00041152

after that there are a lot of these:

kernel: amdgpu: [powerplay] Failed to send message 0x40, response 0xffffffc2 param 0x2
kernel: amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80

until shutdown/hardreset.

Maybe some observation that might help to narrow it down:
The first time it occured, i had to do a few reboots that showed this behaviour right after startup until it finally worked again - for about 45min.
As it didn't work again after around 10 reboots, i tried uninstalling corectrl (that i used to have a custom fan-curve) - and it finally booted normal again!
I then installed radeon-profile to have fan-controll (i don't want to have the fans stand still on desktop, as the card gets over 80° C hot before the fans kick in...).
The issue still occurs with radeon-profile, but at least every reboot is running fine...
Other thing i noticed is that after the first "freeze" with radeon-profile lm_sensors stopped reporting the fanspeed for the card, it always stays at zero.

So maybe it is related to fan-control or the sysfs interface in general?
Comment 4 Matthias Müller 2019-08-30 22:13:50 UTC
Forgot to mention: running Manjaro 5.3rc6.d0826.ga55aa89-1, mesa-git 1:19.3.0_devel.114849.0142dcb990e-1 and llvm-libs-git 10.0.0_r325376.70e158e09e9-1
And if it matters: firmware from https://aur.archlinux.org/packages/linux-firmware-agd5f-radeon-navi10/ v2019.08.26.14.36-1
Comment 5 Mathieu Belanger 2019-08-30 23:20:18 UTC
It probably really depend of what we do on our desktop. I just remember now how I did stop using FileZilla since I got that GPU as it was crashing almost all the time I was using it (Like I never not crashed while that thing was open and running). Still use it for work but I keep it to minimum (open, upload, close) instead of keeping it running.
Comment 6 Alexandr Kára 2019-08-31 07:14:11 UTC
Might be related to https://bugs.freedesktop.org/show_bug.cgi?id=111269. I also get the "ring gfx_0.0.0 timeout" error (but not the "ring sdma0 timeout" error). 

Using LLVM from git + Mesa 19.2.0-rc1 on Fedora 30 with kernel from Fedora 31 (5.3.0-0.rc5.git0.1.fc31.x86_64). GPU AMD Radeon RX 5700 XT, CPU AMD Ryzen 7 1700, 32 GB RAM (EDD).
Comment 7 Mathieu Belanger 2019-08-31 22:15:36 UTC
Created attachment 145225 [details] [review]
Merge last adg5f code

Ok, I did look at the recent kernel patch and commit and they seam to have fixed a couple bugs. I do not know it it include these but I did not crash one time since I merged that into the kernel 5.3-rc6. (that code is staged for 5.4 merge window).

I did attach the patch so you can merge that if you wish to try. It add all the latest bits for AMDGPU into 5.3-rc6, including Renoir support.
Comment 8 Marko Popovic 2019-08-31 22:18:51 UTC
(In reply to Mathieu Belanger from comment #7)
> Created attachment 145225 [details] [review] [review]
> Merge last adg5f code
> 
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
> 
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

How do I merge the patch myself? :) I'd like to try it
Comment 9 Matthias Müller 2019-08-31 23:50:35 UTC
On my side i can report that the issue does not occur if i don't use a tool to modify the FANs - does anyone of you use something of the like or are this seperate issues?
Comment 10 Marko Popovic 2019-09-01 00:36:02 UTC
(In reply to Matthias Müller from comment #9)
> On my side i can report that the issue does not occur if i don't use a tool
> to modify the FANs - does anyone of you use something of the like or are
> this seperate issues?

I don't use any tools, all is stock.

(In reply to Mathieu Belanger from comment #7)
> Created attachment 145225 [details] [review] [review]
> Merge last adg5f code
> 
> Ok, I did look at the recent kernel patch and commit and they seam to have
> fixed a couple bugs. I do not know it it include these but I did not crash
> one time since I merged that into the kernel 5.3-rc6. (that code is staged
> for 5.4 merge window).
> 
> I did attach the patch so you can merge that if you wish to try. It add all
> the latest bits for AMDGPU into 5.3-rc6, including Renoir support.

After applying the patch, same type of error occurs, luckily very easy to reproduce with Citra emulator, apparently it does something that AMD's driver really doesn't like and makes chances higher for error to occur. Also when CPU is under heavy I/O load error seems more likely to occur as well on my end.

Last log after applying the latest patch from the merge posted in the attachment:
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.

If we could get any official AMD responses to at least make sure that we're at least being listened to would be very nice.
Comment 11 Marko Popovic 2019-09-01 10:24:04 UTC
Same bug is also reproducable when launching native version of Rocket League.

Here are the logs:
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: [gfxhub] page fault (src_id:0 ring:158 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:   in page starting at address 0x0000000000fff000 from client 27
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00001B3C
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          MORE_FAULTS: 0x0
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          WALKER_ERROR: 0x6
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          PERMISSION_FAULTS: 0x3
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          MAPPING_ERROR: 0x1
sep 01 12:20:56 Marko-PC kernel: amdgpu 0000:03:00.0:          RW: 0x0
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=7198, emitted seq=7200
sep 01 12:21:12 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RocketLeague pid 3035 thread RocketLeag:cs0 pid 3042
Comment 12 Mathieu Belanger 2019-09-01 16:36:37 UTC
I did not crash and have a > 24h uptime.

I could not test Citra as I don't have a 3DS and the roms I found are encrypted..

I could not test on Rocket League as it would require me to spend for a game I will not play.

I will continue to test later today.
Comment 13 Mathieu Belanger 2019-09-02 06:05:20 UTC
(In reply to Marko Popovic from comment #10)
> (In reply to Matthias Müller from comment #9)
> > On my side i can report that the issue does not occur if i don't use a tool
> > to modify the FANs - does anyone of you use something of the like or are
> > this seperate issues?
> 
> I don't use any tools, all is stock.
> 
> (In reply to Mathieu Belanger from comment #7)
> > Created attachment 145225 [details] [review] [review] [review]
> > Merge last adg5f code
> > 
> > Ok, I did look at the recent kernel patch and commit and they seam to have
> > fixed a couple bugs. I do not know it it include these but I did not crash
> > one time since I merged that into the kernel 5.3-rc6. (that code is staged
> > for 5.4 merge window).
> > 
> > I did attach the patch so you can merge that if you wish to try. It add all
> > the latest bits for AMDGPU into 5.3-rc6, including Renoir support.
> 
> After applying the patch, same type of error occurs, luckily very easy to
> reproduce with Citra emulator, apparently it does something that AMD's
> driver really doesn't like and makes chances higher for error to occur. Also
> when CPU is under heavy I/O load error seems more likely to occur as well on
> my end.
> 
> Last log after applying the latest patch from the merge posted in the
> attachment:
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]]
> *ERROR* Waiting for fences timed out!
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> ring gfx_0.0.0 timeout, signaled seq=16312, emitted seq=16314
> sep 01 02:29:10 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR*
> Process information: process citra-qt pid 2928 thread citra-qt:cs0 pid 2938
> sep 01 02:29:10 Marko-PC kernel: [drm] GPU recovery disabled.
> 
> If we could get any official AMD responses to at least make sure that we're
> at least being listened to would be very nice.

I was able to reproduce that Citra crash.
Followed the instruction, it did crash instantly after choosing continue (or a fraction of a second after, the music lagged a lil and complete system crash (was able so sync/umount/reboot with the magics key)).

Is your crash exactly at the same place? If so then it's very reproducible and  it might be a good idea to run a opengl trace to see what commands was sent last to provoke the crash.

I am not familiar with the Ubuntu stuff, is these got compiled on your system? if no do you know the build date of your Mesa, libdrm and xf86-video-amdgpu (x11 ddx).

Also can you tell what microcode files dates you do have?

Libdrm : 07:49:10 PM 08/27/2019
Mesa : 05:37:07 PM 08/30/2019
Xorg amdgpu DDX : 07:55:17 PM 08/27/2019

The microcode files where not available on my distribution when I installed them. I did download/install them on August 6 but they where from July 15 ish I think, I remember that the latest microcode at that time where crashing with a black screen on module load and that's why I did install an older version.
Comment 14 Marko Popovic 2019-09-02 07:24:16 UTC
(In reply to Mathieu Belanger from comment #13) 
> I was able to reproduce that Citra crash.
> Followed the instruction, it did crash instantly after choosing continue (or
> a fraction of a second after, the music lagged a lil and complete system
> crash (was able so sync/umount/reboot with the magics key)).
> 
> Is your crash exactly at the same place? If so then it's very reproducible
> and  it might be a good idea to run a opengl trace to see what commands was
> sent last to provoke the crash.
> 
> I am not familiar with the Ubuntu stuff, is these got compiled on your
> system? if no do you know the build date of your Mesa, libdrm and
> xf86-video-amdgpu (x11 ddx).
> 
> Also can you tell what microcode files dates you do have?
> 
> Libdrm : 07:49:10 PM 08/27/2019
> Mesa : 05:37:07 PM 08/30/2019
> Xorg amdgpu DDX : 07:55:17 PM 08/27/2019
> 
> The microcode files where not available on my distribution when I installed
> them. I did download/install them on August 6 but they where from July 15
> ish I think, I remember that the latest microcode at that time where
> crashing with a black screen on module load and that's why I did install an
> older version.

Yes, always happens at the same place with Citra emulator, however what bothers me more about the bug is that sometimes it happens completely randomly on my system without any really obvious triggers while just browsing and using my desktop, so it's not Citra exclusive, but luckily I've found the Citra method to provode the bug so we can do more detailed logging.

Further observations:
- Bug is the same-type as other crashes and is not Citra emulator exclusive, happens on Rocket League on launch as well and sometimes randomly while using the desktop
- Same type of crash IS NOT reproducable on Windows on the same GPU
- Same type of bug IS NOT reproducable on my IntelHD laptop with same versions of MESA/LLVM which probably means either faulty AMD kernel driver or faulty Firmware binaries.

My versions are:
MESA: Mesa 19.3.0-devel (git-6775a52 2019-09-02 eoan-oibaf-ppa)
Kernel: Ubuntu mainline 5.3 daily build (I ALSO tried amd-drm-next-5.4, same bug is reproducable)
Firmware binaries: 2019-08-26 from /~agd5f/radeon_ucode/navi10
Comment 15 Pierre-Eric Pelloux-Prayer 2019-09-02 08:01:39 UTC
(In reply to Marko Popovic from comment #14)
> 
> Yes, always happens at the same place with Citra emulator

Could you capture a trace of the problem (using Apitrace or Renderdoc)?

This would be very helpful to fix it.
Comment 16 Marko Popovic 2019-09-02 08:25:17 UTC
Created attachment 145232 [details]
APITrace log from Citra crash
Comment 17 Marko Popovic 2019-09-02 08:26:32 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> (In reply to Marko Popovic from comment #14)
> > 
> > Yes, always happens at the same place with Citra emulator
> 
> Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> 
> This would be very helpful to fix it.

I added reproduced Citra crash recorded by using command:
apitrace trace ./citra-qt

I hope this is correct, if you need anything else or done differently please just let me know!
Comment 18 Marko Popovic 2019-09-02 09:13:48 UTC
Created attachment 145233 [details]
APITrace log from RocketLeague crash

I am adding Rocket League crash output from apitrace.
Comment 19 Pierre-Eric Pelloux-Prayer 2019-09-02 11:53:41 UTC
(In reply to Marko Popovic from comment #17)
> (In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> > (In reply to Marko Popovic from comment #14)
> > > 
> > > Yes, always happens at the same place with Citra emulator
> > 
> > Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> > 
> > This would be very helpful to fix it.
> 
> I added reproduced Citra crash recorded by using command:
> apitrace trace ./citra-qt
> 
> I hope this is correct, if you need anything else or done differently please
> just let me know!

Thanks for the trace!

Replaying the trace a few times is enough to reliably to reproduce the hang.

Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary workaround until a proper fix is found.
Could you confirm this on your system?


> 
> I am adding Rocket League crash output from apitrace.

This trace file is very small (only one frame) and doesn't hang here.
Comment 20 Marko Popovic 2019-09-02 12:24:49 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #19)
> Thanks for the trace!
> 
> Replaying the trace a few times is enough to reliably to reproduce the hang.
> 
> Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary
> workaround until a proper fix is found.
> Could you confirm this on your system?
> 
> 
> > 
> > I am adding Rocket League crash output from apitrace.
> 
> This trace file is very small (only one frame) and doesn't hang here.

Thanks for the workaround! Here are my results:

-AMD_DEBUG=nongg works to fix the Citra-related crash

- It doesn't work to fix Rocket League related hang, that seems to be a completely different beast... the GPU hang happens as well but I don't know why, apparently apitrace doesn't provide any useful information as to why it happens.

Now I will continue testing to see whether citra-related crash workaround also works for my desktop random freezes and hangs and will report back. I added AMD_DEBUG=nongg to my /etc/environment so it should be applied to desktop as well.
Comment 21 Marko Popovic 2019-09-02 16:45:09 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #19)
> (In reply to Marko Popovic from comment #17)
> > (In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> > > (In reply to Marko Popovic from comment #14)
> > > > 
> > > > Yes, always happens at the same place with Citra emulator
> > > 
> > > Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> > > 
> > > This would be very helpful to fix it.
> > 
> > I added reproduced Citra crash recorded by using command:
> > apitrace trace ./citra-qt
> > 
> > I hope this is correct, if you need anything else or done differently please
> > just let me know!
> 
> Thanks for the trace!
> 
> Replaying the trace a few times is enough to reliably to reproduce the hang.
> 
> Using AMD_DEBUG=nongg seems to prevent it so it could be a temporary
> workaround until a proper fix is found.
> Could you confirm this on your system?
> 
> 
> > 
> > I am adding Rocket League crash output from apitrace.
> 
> This trace file is very small (only one frame) and doesn't hang here.

Okay I just got another random hang on the desktop. even with the environment variable turned on the whole time. Unfortunately it seems to be very hardly tracable seems to be very random :( Seems that Citra hang is unrelated to this bug after all, it's a completely different bug. It's good that we discovered another (citra-related) bug on the way but probably we can't mark that workaround to solve anything because hangs still randomly occur on the desktop.
Comment 22 Pierre-Eric Pelloux-Prayer 2019-09-02 17:01:52 UTC
> Okay I just got another random hang on the desktop. even with the
> environment variable turned on the whole time. Unfortunately it seems to be
> very hardly tracable seems to be very random :( Seems that Citra hang is
> unrelated to this bug after all, it's a completely different bug. It's good
> that we discovered another (citra-related) bug on the way but probably we
> can't mark that workaround to solve anything because hangs still randomly
> occur on the desktop.

Yes, it's possible that there are different bugs.

For the citra bug: I suspect an issue with Geometry Shaders + NGG but this will require more debugging to confirm (also: using wavesize=64 didn't help, so it's not a regression caused by a0d330bedb9e).

I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the sdma0 kind of hangs.
Comment 23 Marko Popovic 2019-09-02 17:05:49 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #22)
> > Okay I just got another random hang on the desktop. even with the
> > environment variable turned on the whole time. Unfortunately it seems to be
> > very hardly tracable seems to be very random :( Seems that Citra hang is
> > unrelated to this bug after all, it's a completely different bug. It's good
> > that we discovered another (citra-related) bug on the way but probably we
> > can't mark that workaround to solve anything because hangs still randomly
> > occur on the desktop.
> 
> Yes, it's possible that there are different bugs.
> 
> For the citra bug: I suspect an issue with Geometry Shaders + NGG but this
> will require more debugging to confirm (also: using wavesize=64 didn't help,
> so it's not a regression caused by a0d330bedb9e).
> 
> I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the
> sdma0 kind of hangs.

Yes both Rocket League and Desktop hangs seem to be the sdma0 type. I will add that parameter as well and see if there is any difference with Rocket League hang and use the desktop with both flags enabled.
 
Well I mean actually finding multiple bugs while debugging 1 can only be a good thing, after all less bugs in the future and my personal computing seems to have quite a few corner cases it seems that otherwise go unnoticed :D which should benefit many new happy Navi users
Comment 24 Marko Popovic 2019-09-02 17:16:13 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #22)
> Yes, it's possible that there are different bugs.
> 
> For the citra bug: I suspect an issue with Geometry Shaders + NGG but this
> will require more debugging to confirm (also: using wavesize=64 didn't help,
> so it's not a regression caused by a0d330bedb9e).
> 
> I'm also testing using AMD_DEBUG=nodma system wide to see if it prevents the
> sdma0 kind of hangs.

Ok, I confirm that AMD_DEBUG=nodma gets rid of Rocket-League startup crash, will report about the desktop stability for the rest of the day!
Comment 25 Mathieu Belanger 2019-09-03 14:56:26 UTC
I confirm that a system wide nongg do not fix random surprise crash I get on filezilla and phpstorm.

Switching to system wide nodma (that sound scary on the performance side)
Comment 26 Marko Popovic 2019-09-04 12:20:23 UTC
(In reply to Mathieu Belanger from comment #25)
> I confirm that a system wide nongg do not fix random surprise crash I get on
> filezilla and phpstorm.
> 
> Switching to system wide nodma (that sound scary on the performance side)

Yes but that unfortunately is exactly what "solved" the sdma0 freezes for me. Let's hope that a proper fix comes as soon as possible!
Comment 27 Mathieu Belanger 2019-09-04 12:24:44 UTC
It did fix it for me too.
Comment 28 Pierre-Eric Pelloux-Prayer 2019-09-04 15:36:06 UTC
Regarding sdma ring hangs: if you still have access to the affected machine using ssh, it would be helpful to add a comment with the following information:

  - the last dmesg lines (at least the "[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=9871, emitted seq=9873" one)
  - the output of : umr -R sdma0 (or sdma1 depending on which one failed)

Thanks!
Comment 29 Marko Popovic 2019-09-05 11:14:41 UTC
(In reply to Mathieu Belanger from comment #27)
> It did fix it for me too.

(In reply to Pierre-Eric Pelloux-Prayer from comment #28)
> Regarding sdma ring hangs: if you still have access to the affected machine
> using ssh, it would be helpful to add a comment with the following
> information:
> 
>   - the last dmesg lines (at least the "[drm:amdgpu_job_timedout [amdgpu]]
> *ERROR* ring sdma1 timeout, signaled seq=9871, emitted seq=9873" one)
>   - the output of : umr -R sdma0 (or sdma1 depending on which one failed)
> 
> Thanks!

Mathieu could you assist Pierre-Eric with this? 
I am currently on vacation and won't be able to debug or test further until 15th of September.
Comment 30 Mathieu Belanger 2019-09-05 11:50:19 UTC
I will disable the workaround friday after work.

Then I will report when it will crash.
Comment 31 Mathieu Belanger 2019-09-06 01:58:37 UTC
Is that patch set https://lists.freedesktop.org/archives/amd-gfx/2019-September/039593.html relate to this ?

Graceful page fault handling for Vega/Navi
Comment 32 Sebastian Meyer 2019-09-10 14:19:25 UTC
Having the same issues with my new Powercolor RX 5700 XT on Arch Linux.
System freezes after a couple of seconds when I try to run games like RotTR. Other games I've tested, like Dota 2 for example, are unreliable and make the system freeze after a few of minutes or after an hour or so.

The dmesg output when SSHing into my system:
[65070.475185] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65070.475259] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595093] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595180] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[65075.595260] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=6662176, emitted seq=6662178
[65075.595322] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 56804 thread RiseOfTheT:cs0 pid 56811
[65075.595324] [drm] GPU recovery disabled.

I've also had a couple of sdma0/sdma1 related freezes after opening resource-heavy websites in Chromium. Unfortunately though, I'm unable to reproduce it now. If the system freezes again, I will provide logs and umr output, as requested. The website which caused most of the freezes was izurvive.com (interactive DayZ map) and it froze while toggling map markers on and off.
Sep 08 17:49:52 basti-pc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=2372, emitted seq=2375
Sep 08 17:49:57 basti-pc kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process chromium pid 1271 thread chromium:cs0 pid 1331

$ pacman -Q linux-mainline linux-firmware-agd5f-radeon-navi10 {,lib32-}{mesa-git,vulkan-radeon-git,llvm-git,libdrm-git}
linux-mainline 5.3rc8-1
linux-firmware-agd5f-radeon-navi10 2019.08.26.14.36-1
mesa-git 1:19.3.0_devel.115190.f83f9d7daa0-1
lib32-mesa-git 1:19.3.0_devel.115190.f83f9d7daa0-1
vulkan-radeon-git 1:19.3.0_devel.115190.f83f9d7daa0-1
lib32-vulkan-radeon-git 1:19.3.0_devel.115190.f83f9d7daa0-1
llvm-git 10.0.0_r326348.d7d8bb937ad-1
lib32-llvm-git 10.0.0_r326355.d065c811649-1
libdrm-git 2.4.99.r17.g10cd9c3d-1
lib32-libdrm-git 2.4.99.r17.g10cd9c3d-1
Comment 33 Pierre-Eric Pelloux-Prayer 2019-09-10 15:23:51 UTC
Created attachment 145323 [details] [review]
wip patch

You can give a try to the attached kernel patch which hopefully could prevent some sdma timeouts.

I'm still testing it but the more testers the better :)
Comment 34 Mathieu Belanger 2019-09-10 15:36:52 UTC
Patch applied

Removed nodma from the /etc/environment

Will reboot at lunch time, Usually my IDEs trigger the crash. Will see how it go.
Comment 35 Alexandr Kára 2019-09-10 18:25:07 UTC
Created attachment 145324 [details]
UMR dump of registers on a GPU lockup

Sending dmesg output + UMR registers dump of both sdma0 and sdma1 for a lockup in Rise of the Tomb Raider.

[drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=104586, emitted seq=104588
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 8457 thread RiseOfTheT:cs0 pid 8463
[drm] GPU recovery disabled.

The lockup is reproducible and only affects the GPU - it's still fine to ssh to the machine and it's otherwise working fine.
Comment 36 Sebastian Meyer 2019-09-10 21:02:44 UTC
Created attachment 145326 [details]
umr output of sdma0/sdma1 after RotTR freeze

Applied the provided WIP patch to linux-mainline 5.3-rc8 and started RotTR again in order to trigger a system freeze.
This time I also got a ring sdma0 and sdma1 timeout:

[  632.175837] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  632.175973] [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
[  637.299049] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=313757, emitted seq=313759
[  637.299110] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RiseOfTheTombRa pid 2584 thread RiseOfTheT:cs0 pid 2590
[  637.299111] [drm] GPU recovery disabled.
[  646.468871] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=278259, emitted seq=278263
[  646.468961] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=21116, emitted seq=21119
[  646.469052] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[  646.469141] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process plasmashell pid 989 thread plasmashel:cs0 pid 1155
[  646.469141] [drm] GPU recovery disabled.
[  646.469142] [drm] GPU recovery disabled.

Stdout of `umr -R sdma0` and `umr -R sdma1` is attached to this post, however, I also got a couple of stderr messages like "[ERROR]: No valid mapping for 3@800000023f00" which I didn't include in the output.
Comment 37 Jeremy Silliman 2019-09-12 12:21:35 UTC
I purchased a 5700XT the other day, and what I've noticed is that anything that tries getting statistics from the GPU (radeontop, lm_sensors) induces a page fault hang within a couple of minutes. In my testing I either ran lm_sensors every three seconds, or radeontop, and left it idle while playing a game or watching a video, and without fail, a hang would happen shortly after. As soon as I stopped running either of those programs the hangs stopped. This may work as a reproducible test case for some of the hangs.
Comment 38 Mathieu Belanger 2019-09-13 05:22:47 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #33)
> Created attachment 145323 [details] [review] [review]
> wip patch
> 
> You can give a try to the attached kernel patch which hopefully could
> prevent some sdma timeouts.
> 
> I'm still testing it but the more testers the better :)

So far so good. Your patch seam to have fixed the "random" crash That I was able to replicate when I was loading my 3 many tabs browsers and phpstorm in the same time and I can use my IDE without crashing too.

Maybe I got really lucky too. But it's been more than a day without crash and without the nodma "fix"
Comment 39 Shmerl 2019-09-15 02:41:08 UTC
I also get such freezes when opening a new tab in Firefox (once in a while), and when using ksysguard to read amdgpu sensors with Sapphire Pulse RX 5700 XT. I'm going to try this patch.
Comment 40 Shmerl 2019-09-15 07:52:39 UTC
With that patch, I get stutters, but not hard freeze when using ksysguard with reading amdgpu sensors. I see such errors in dmesg when that happens:

14889.400985] amdgpu: [powerplay] Failed to export SMU metrics table!
[14890.311391] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14891.933714] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14892.785612] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14892.785615] amdgpu: [powerplay] Failed to export SMU metrics table!
[14894.406389] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14894.406393] amdgpu: [powerplay] Failed to export SMU metrics table!
[14895.261140] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14896.937622] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80
[14897.734712] amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2 param 0x80
[14897.734714] amdgpu: [powerplay] Failed to export SMU metrics table!
Comment 41 Shmerl 2019-09-15 17:45:47 UTC
Just FYI, I just just used latest firmware from here (2019-09-13): https://people.freedesktop.org/~agd5f/radeon_ucode/navi10/

It didn't make a difference, ksysguard is still causing those powerplay errors.
Comment 42 Marko Popovic 2019-09-15 19:57:58 UTC
Ok I came home from vacation and got my hands on the WIP patch. 

Rocket-League startup SDMA-type freeze is completely gone.

I will continue testing the desktop usage without nodma enabled and will report if it fixes the random SDMA freezes as well :)

Will keep you guys updated.
Comment 43 Marko Popovic 2019-09-15 20:37:11 UTC
(In reply to Marko Popovic from comment #42)
> Ok I came home from vacation and got my hands on the WIP patch. 
> 
> Rocket-League startup SDMA-type freeze is completely gone.
> 
> I will continue testing the desktop usage without nodma enabled and will
> report if it fixes the random SDMA freezes as well :)
> 
> Will keep you guys updated.

Update: Ok NVM, I said it too fast, RL SDMA freezes came back even with the WIP patch applied. Here is the output:

sep 15 22:34:15 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=4302, emitted seq=4304
sep 15 22:34:15 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process RocketLeague pid 3123 thread RocketLeag:cs0 pid 3130
Comment 44 Marko Popovic 2019-09-15 22:22:43 UTC
Another Update:

Unfortunately even with WIP patch applied I got another random desktop freeze in Gnome shell of SDMA type.

Turned back the nodma tweak in order to avoid having those until further fixes are found.

Another note: I use the most recent firmware libraries and there don't seem to be any improvements on the sdma freezes front.
Comment 45 Mathieu Belanger 2019-09-16 05:31:34 UTC
Just an update : Still no new "random" crash since patch applied.

The only crash I got since patch applied are some partial and recoverable crash that occurred due to insufficient voltage to the overclocked CPU.

So that WIP patch did fix some of the crashs in this bug report but not all.
Comment 46 Marko Popovic 2019-09-16 06:47:05 UTC
(In reply to Mathieu Belanger from comment #45)
> Just an update : Still no new "random" crash since patch applied.
> 
> The only crash I got since patch applied are some partial and recoverable
> crash that occurred due to insufficient voltage to the overclocked CPU.
> 
> So that WIP patch did fix some of the crashs in this bug report but not all.

Unfortunately I wasn't so lucky, I got both ransom and provoked sdma freeze soon afzer disabling the nodma variable :(
Comment 47 Mathieu Belanger 2019-09-16 18:16:25 UTC
Naa, Random crash still occur with FileZilla, so there not totally gone for me. I put nodma back because I use that system for work.
Comment 48 Marko Popovic 2019-09-17 10:23:23 UTC
(In reply to Pierre-Eric Pelloux-Prayer from comment #15)
> (In reply to Marko Popovic from comment #14)
> > 
> > Yes, always happens at the same place with Citra emulator
> 
> Could you capture a trace of the problem (using Apitrace or Renderdoc)?
> 
> This would be very helpful to fix it.

There is another type of freeze/hang happening when playing Starcraft II via D9VK. This one doesn't seem to be related to either ngg or dma because I have them both disabled and the hang occurs anyway.

sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out or interrupted!
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=2361623, emitted seq=2361625
sep 17 11:48:24 Marko-PC kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process SC2_x64.exe pid 20236 thread SC2_x64.exe pid 20236

Is there any way to apitrace the Vulkan API?
Comment 49 Shmerl 2019-09-17 21:24:53 UTC
Could be just a similar symptom, but I have a freeze with The Bard's Tale IV with the same error message: https://bugs.freedesktop.org/show_bug.cgi?id=111591

It's going through radeonsi path though.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.