Bug 103102

Summary:

Cannot wake-up with an AMD RX 480 on Linux 4.13 and Linux 4.14

Product:

DRI

Reporter:

Hadrien Nilsson <freedesktop>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

radeon.20.mathieui

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
kern.log of boot/suspend/resume with Linux kernel 4.12.5	none
kern.log of boot/suspend/resume with Linux kernel 4.12.6	none

Description Hadrien Nilsson 2017-10-04 18:52:34 UTC

I used to suspend/resume my system on Ubuntu 17.04 and it was fine. I installed Ubuntu 17.10 that comes with Linux 4.13 and now my system cannot wake-up anymore.

I submitted the bug on Ubuntu's web site. I tried Linux 4.14 but the problem was still present. Then I've been told to open a bug here.

The original bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1720622

/var/log/kern.log shows several errors regarding amdgpu.

My RX 480 is connected to one single screen with DisplayPort. The problem occurs with both wayland and xorg.

Comment 1 Hadrien Nilsson 2017-10-05 18:34:35 UTC

Extract of /var/log/kern.log when I did the suspend/resume:

Oct 3 21:03:28 c18 kernel: [ 62.519787] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed
Oct 3 21:03:28 c18 kernel: [ 62.519795] [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
Oct 3 21:03:28 c18 kernel: [ 62.519803] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
Oct 3 21:03:28 c18 kernel: [ 62.519806] dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -110
Oct 3 21:03:28 c18 kernel: [ 62.519806] PM: Device 0000:01:00.0 failed to resume async: error -110

...

Oct 3 21:04:21 c18 kernel: [ 115.155901] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting
Oct 3 21:04:21 c18 kernel: [ 115.155912] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing BFFC (len 116, WS 0, PS 0) @ 0xC049
Oct 3 21:04:21 c18 kernel: [ 115.155955] amdgpu 0000:01:00.0: ffff9ffe950e2800 unpin not necessary

Comment 2 Hadrien Nilsson 2017-10-05 19:11:11 UTC

I tried several kernels:

4.12.4: OK
4.12.5: OK
4.12.6: FAIL, but after 20 seconds the monitor displays something: pure garbage
4.12.7: FAIL, the monitor stays OFF

I'll attach the kernel logs.

Interesting part in 4.12.6:

Oct  5 20:56:21 c18 kernel: [   37.277968] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed
Oct  5 20:56:21 c18 kernel: [   37.277976] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
Oct  5 20:56:21 c18 kernel: [   37.277983] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
Oct  5 20:56:21 c18 kernel: [   37.277985] dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110
Oct  5 20:56:21 c18 kernel: [   37.277986] PM: Device 0000:01:00.0 failed to resume async: error -110

There is a slight difference with more recent kernels, where the message mentions ring 14 instead of ring 13 and some other functions too.

Comment 3 Hadrien Nilsson 2017-10-05 19:12:28 UTC

Created attachment 134688 [details]
kern.log of boot/suspend/resume with Linux kernel 4.12.5

Comment 4 Hadrien Nilsson 2017-10-05 19:12:45 UTC

Created attachment 134689 [details]
kern.log of boot/suspend/resume with Linux kernel 4.12.6

Comment 5 Hadrien Nilsson 2017-10-05 19:25:08 UTC

The problem looks a bit like this one: https://bugzilla.kernel.org/show_bug.cgi?id=196615

Comment 6 Hadrien Nilsson 2017-10-09 19:47:51 UTC

I tried Phoronix kernel image from Alex Deucher's drm-next-4.15-dc Git branch. I still get the same problem with and without amdgpu.dc=1.

Oct  9 21:29:12 c18 kernel: [   32.199451] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed
Oct  9 21:29:12 c18 kernel: [   32.199461] [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110
Oct  9 21:29:12 c18 kernel: [   32.199471] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110).
Oct  9 21:29:12 c18 kernel: [   32.199474] dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110
Oct  9 21:29:12 c18 kernel: [   32.199474] PM: Device 0000:01:00.0 failed to resume async: error -110
Oct  9 21:29:12 c18 kernel: [   32.199498] PM: resume of devices complete after 539.171 msecs

I've been looking at the changelog of kernel 4.12.6 but there isn't much changes in amdgpu. Maybe there are changes in another impacting amdgpu? I have these errors before amdgpu errors:

Oct  9 21:29:12 c18 kernel: [   32.035186] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364)
Oct  9 21:29:12 c18 kernel: [   32.035188] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT0._GTF, AE_NOT_FOUND (20170531/psparse-550)
Oct  9 21:29:12 c18 kernel: [   32.035206] ata6: SATA link down (SStatus 4 SControl 300)
Oct  9 21:29:12 c18 kernel: [   32.035225] ata2: SATA link down (SStatus 4 SControl 300)
Oct  9 21:29:12 c18 kernel: [   32.035231] ata1.00: supports DRM functions and may not be fully accessible
Oct  9 21:29:12 c18 kernel: [   32.035901] ata1.00: disabling queued TRIM support
Oct  9 21:29:12 c18 kernel: [   32.037457] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364)
Oct  9 21:29:12 c18 kernel: [   32.037459] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT0._GTF, AE_NOT_FOUND (20170531/psparse-550)
Oct  9 21:29:12 c18 kernel: [   32.037531] ata1.00: supports DRM functions and may not be fully accessible
Oct  9 21:29:12 c18 kernel: [   32.038084] ata1.00: disabling queued TRIM support
Oct  9 21:29:12 c18 kernel: [   32.038704] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364)
Oct  9 21:29:12 c18 kernel: [   32.038706] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT4._GTF, AE_NOT_FOUND (20170531/psparse-550)
Oct  9 21:29:12 c18 kernel: [   32.039418] ata1.00: configured for UDMA/133
Oct  9 21:29:12 c18 kernel: [   32.055252] ata4: SATA link down (SStatus 4 SControl 300)
Oct  9 21:29:12 c18 kernel: [   32.082715] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364)
Oct  9 21:29:12 c18 kernel: [   32.082718] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT4._GTF, AE_NOT_FOUND (20170531/psparse-550)
Oct  9 21:29:12 c18 kernel: [   32.114002] ata5.00: configured for UDMA/133

Comment 7 Michel Dänzer 2017-10-11 17:00:42 UTC

Is there any chance you can bisect between 4.12.5 and 4.12.6?

Comment 8 Alex Deucher 2017-10-11 17:14:19 UTC

Does the patch on this bug help?
https://bugzilla.kernel.org/show_bug.cgi?id=196615

Comment 9 Hadrien Nilsson 2017-10-11 19:54:31 UTC

(In reply to Michel Dänzer from comment #7)
> Is there any chance you can bisect between 4.12.5 and 4.12.6?

I think this is the next step to diagnose that problem. I've never built and installed my own kernel before so I will have to learn how to do it.

(In reply to Alex Deucher from comment #8)
> Does the patch on this bug help?
> https://bugzilla.kernel.org/show_bug.cgi?id=196615

It was my first clue as the kernel log looks a bit the same but the problematic commit identified in that bug ticket is present in 4.12.4 and 4.12.5, versions that work on my computer. When I learn how to build my own kernel I will try to remove the commit just to be sure.

Comment 10 Hadrien Nilsson 2017-10-13 20:11:51 UTC

I built v4.12.5 from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git instead of using a prebuilt kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12.5/ and I unexpectedly faced the problem.

I wonder why I cannot reproduce the problem with Ubuntu's version of kernel 4.12.5. Maybe there are some Ubuntu specific patches that hide the problem?

Anyway, I rebuilt v4.2.5 again but this time I removed the code from the commit referenced on kernel.org Bugzilla:

------------------------------------------------------------------------------------------------------
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
index c0a806280257..f862e3d9cd93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c
@@ -838,9 +838,10 @@ static int amdgpu_cgs_get_active_displays_info(struct cgs_device *cgs_device,
                return -EINVAL;
 
        mode_info = info->mode_info;
+
        if (mode_info) {
                /* if the displays are off, vblank time is max */
-               mode_info->vblank_time_us = 0xffffffff;
+               /*mode_info->vblank_time_us = 0xffffffff;*/
                /* always set the reference clock */
                mode_info->ref_clock = adev->clock.spll.reference_freq;
        }

------------------------------------------------------------------------------------------------------

And I cannot reproduce the problem anymore. So I guess this is actually the same problem than https://bugzilla.kernel.org/show_bug.cgi?id=196615

Comment 11 Hadrien Nilsson 2017-10-14 16:57:55 UTC

I've been setting vblank_time_us to 0x7fffffff, 0x0000ffff, 0x000000ff and I get the same problem.

Comment 12 Hadrien Nilsson 2017-10-21 08:05:28 UTC

Alex, I applied the changes you attached at kernel.org regarding /drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c to my 4.12.5 kernel (and I kept the line that sets vblank_time_us to 0xffffffff).

It fixes the issue. I suspended and resumed my system twice without problem. Good catch!

Comment 13 Hadrien Nilsson 2017-12-07 22:08:34 UTC

Today I got a new kernel for Ubuntu 17.10: 4.13.0-19-generic #22-Ubuntu

Everything works fine. I didn't test it on 4.14 but it seems the fix was reported there too, so if everybody is ok I will close this bug.

Comment 14 Alex Deucher 2017-12-08 13:50:32 UTC

Fixed in 8b95f4f730cba02ef6febbdc4ca7e55ca045b00e

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.