107154 – [drm] GPU recovery disabled.

Bug 107154 - [drm] GPU recovery disabled.

Summary: [drm] GPU recovery disabled.

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-07-08 09:24 UTC by freedesktop.org
Modified:	2018-09-11 13:38 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
dmesg amdgpu.dc=1 (108.19 KB, text/plain) 2018-07-09 16:03 UTC, freedesktop.org	no flags	Details
dmesg /etc/modprobe.d/ (136.06 KB, text/plain) 2018-07-09 16:04 UTC, freedesktop.org	no flags	Details
dmesg 4.14 LTS (110.47 KB, text/plain) 2018-07-09 16:29 UTC, freedesktop.org	no flags	Details
View All

Description freedesktop.org 2018-07-08 09:24:30 UTC

Hi!

This is a surprisingly long standing problem with a RX 460, more precisely since 4.15 all the way up to 4.18 AMD staging DRM next [1]. 
After resuming from sleep (echo -n mem > /sys/power/state) amdgpu is dead (always, reliably).
Here's what dmesg has to say about it:

[Sun Jul  8 11:01:17 2018] PM: suspend exit
[Sun Jul  8 11:01:19 2018] [drm:gfx_v8_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out.
[Sun Jul  8 11:01:19 2018] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on GFX ring (-110).
[Sun Jul  8 11:01:19 2018] [drm:process_one_work] *ERROR* ib ring test failed (-110).
[Sun Jul  8 11:01:28 2018] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=864, last emitted seq=868
[Sun Jul  8 11:01:28 2018] [drm] GPU recovery disabled.

From ealier versions:

[   42.802559] PM: suspend exit
[   42.824332] amdgpu 0000:41:00.0: GPU fault detected: 147 0x0bd84802
[   42.824338] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0034F97B
[   42.824341] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C048002
[   42.824345] amdgpu 0000:41:00.0: VM fault (0x02, vmid 6) at page 3471739, read from 'TC0' (0x54433000) (72)
[   52.956306] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1287, last emitted seq=1289
[   52.956316] [drm] IP block:gfx_v8_0 is hung!
[   52.956362] [drm] GPU recovery disabled.

I've also seen fault 146 but other than that it mostly looks the same. 4.14-lts (with dc=0) works fine.

RX 460, Zenith Extreme, 1950x.

[1] arch linux AUR; this versioning is a bit confusing, it may actually already be the 4.19 branch, latest commit is3838e387fd1eb17bfcf6ff7d443d931adb5cb41b

Comment 1 dwagner 2018-07-08 19:00:26 UTC

Indeed, crashes upon S3 resumes have been abundant with amdgpu.dc=1 for many months now, and seemingly for more than one reason.

One bug I reported in August 2017 with https://bugs.freedesktop.org/show_bug.cgi?id=102323 - that one was fixed quickly.

The next S3 resume crash I reported in October 2017 in https://bugs.freedesktop.org/show_bug.cgi?id=103277, that one stayed without any resolution until April 2018, and the fix found in that report only works if no "drm.edid_firmware=..." kernel command line option is used.

Another crash bug with S3 resumes I reported for 4.17.2 kernels in https://bugs.freedesktop.org/show_bug.cgi?id=107065 - then realized that 4.18 pre-releases exhibit the very same kind of crash immediately upon starting X11. For this crash upon X11 startup, there is a patch in the bug report, but it does not prevent the S3 resume crash.

I currently work around S3 resume crashes by switching to the console display before enterin S3 sleep - but this is really an awkward work-around.

Comment 2 freedesktop.org 2018-07-08 20:03:10 UTC

(In reply to dwagner from comment #1)
> I currently work around S3 resume crashes by switching to the console
> display before enterin S3 sleep - but this is really an awkward work-around.

Oh, that doesn't help either. It crashes the very moment I switch back to X.

And what's more starting with 4.15 amdgpu.dc=0 doesn't appear to make any difference.

Comment 3 Michel Dänzer 2018-07-09 08:53:16 UTC

Please attach the full dmesg output.

Can you bisect between 4.14 and 4.15?

Comment 4 Christian König 2018-07-09 11:31:10 UTC

Do you have a full dmesg?

Comment 5 freedesktop.org 2018-07-09 16:03:40 UTC

Created attachment 140525 [details]
dmesg amdgpu.dc=1

Booted with amdgpu.dc=1.

Comment 6 freedesktop.org 2018-07-09 16:04:11 UTC

Created attachment 140526 [details]
dmesg /etc/modprobe.d/

Booted with amdgpu.dc=1 in /etc/modprobe.d/

Comment 7 freedesktop.org 2018-07-09 16:13:33 UTC

Sure, attached. AMD staging kernel. I don't know how to tell whether DC=1 is really enabled, so I did two runs: one with amdgpu.dc=1 as boot parameter and one with /etc/modprobe.d/ on top of that.

Procedure was the same both times:
- boot
- X login
- switch to console
- sleep, wakeup
- switch to X

The drm/amdgpu lines appear already in the console right after waking up, prior to switching to X.

This time "only" X crashed (could still move the pointer); at times the complete machine is dead, no switching to console and and no SSH.

(as a side note: is is normal that waking up on ryzen takes something on the order of 10-30s? I'm used to split second wakeups on Intel.)

HTH

Comment 8 freedesktop.org 2018-07-09 16:29:34 UTC

Created attachment 140528 [details]
dmesg 4.14 LTS

Sorry, forgot about the requested 4.14 dmesg log. Attached as well.

This is: boot, login (to KDE this time), do stuff, remember, sleep, wakeup.

Comment 9 Christian König 2018-07-10 07:04:20 UTC

Yeah, that is a known problem in the PCI subsystem. Will be fixed with 4.19 and then backported to older kernels.

Comment 10 freedesktop.org 2018-09-02 10:26:59 UTC

So, there's 4.19rc1-amd-next \o/

echo: write error: Device or resource busy

This started to happen with 4.18. dmesg:

[  171.245467] Freezing of tasks failed after 20.006 seconds (1 tasks refusing to freeze, wq_busy=0):
[  171.245484] systemd-udevd   D    0   700    615 0x80000124

So, is this sth. to report to fricking systemd to?


Gee, really...?!

Comment 11 Kyle De'Vir 2018-09-11 13:38:32 UTC

> systemd-udevd

This is not systemd's fault, but indicative of something hanging in kernel land, which udevd ends up being blocked on.

Experienced this a few major kernel releases ago, which were resolved by the next major version. Never did figure out what caused udevd to block... :/

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.