I used to suspend/resume my system on Ubuntu 17.04 and it was fine. I installed Ubuntu 17.10 that comes with Linux 4.13 and now my system cannot wake-up anymore. I submitted the bug on Ubuntu's web site. I tried Linux 4.14 but the problem was still present. Then I've been told to open a bug here. The original bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1720622 /var/log/kern.log shows several errors regarding amdgpu. My RX 480 is connected to one single screen with DisplayPort. The problem occurs with both wayland and xorg.
Extract of /var/log/kern.log when I did the suspend/resume: Oct 3 21:03:28 c18 kernel: [ 62.519787] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed Oct 3 21:03:28 c18 kernel: [ 62.519795] [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 Oct 3 21:03:28 c18 kernel: [ 62.519803] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). Oct 3 21:03:28 c18 kernel: [ 62.519806] dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -110 Oct 3 21:03:28 c18 kernel: [ 62.519806] PM: Device 0000:01:00.0 failed to resume async: error -110 ... Oct 3 21:04:21 c18 kernel: [ 115.155901] [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 5secs aborting Oct 3 21:04:21 c18 kernel: [ 115.155912] [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing BFFC (len 116, WS 0, PS 0) @ 0xC049 Oct 3 21:04:21 c18 kernel: [ 115.155955] amdgpu 0000:01:00.0: ffff9ffe950e2800 unpin not necessary
I tried several kernels: 4.12.4: OK 4.12.5: OK 4.12.6: FAIL, but after 20 seconds the monitor displays something: pure garbage 4.12.7: FAIL, the monitor stays OFF I'll attach the kernel logs. Interesting part in 4.12.6: Oct 5 20:56:21 c18 kernel: [ 37.277968] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 13 test failed Oct 5 20:56:21 c18 kernel: [ 37.277976] [drm:amdgpu_resume [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 Oct 5 20:56:21 c18 kernel: [ 37.277983] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). Oct 5 20:56:21 c18 kernel: [ 37.277985] dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110 Oct 5 20:56:21 c18 kernel: [ 37.277986] PM: Device 0000:01:00.0 failed to resume async: error -110 There is a slight difference with more recent kernels, where the message mentions ring 14 instead of ring 13 and some other functions too.
Created attachment 134688 [details] kern.log of boot/suspend/resume with Linux kernel 4.12.5
Created attachment 134689 [details] kern.log of boot/suspend/resume with Linux kernel 4.12.6
The problem looks a bit like this one: https://bugzilla.kernel.org/show_bug.cgi?id=196615
I tried Phoronix kernel image from Alex Deucher's drm-next-4.15-dc Git branch. I still get the same problem with and without amdgpu.dc=1. Oct 9 21:29:12 c18 kernel: [ 32.199451] [drm:amdgpu_vce_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 14 test failed Oct 9 21:29:12 c18 kernel: [ 32.199461] [drm:amdgpu_resume_phase2 [amdgpu]] *ERROR* resume of IP block <vce_v3_0> failed -110 Oct 9 21:29:12 c18 kernel: [ 32.199471] [drm:amdgpu_device_resume [amdgpu]] *ERROR* amdgpu_resume failed (-110). Oct 9 21:29:12 c18 kernel: [ 32.199474] dpm_run_callback(): pci_pm_resume+0x0/0xa0 returns -110 Oct 9 21:29:12 c18 kernel: [ 32.199474] PM: Device 0000:01:00.0 failed to resume async: error -110 Oct 9 21:29:12 c18 kernel: [ 32.199498] PM: resume of devices complete after 539.171 msecs I've been looking at the changelog of kernel 4.12.6 but there isn't much changes in amdgpu. Maybe there are changes in another impacting amdgpu? I have these errors before amdgpu errors: Oct 9 21:29:12 c18 kernel: [ 32.035186] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364) Oct 9 21:29:12 c18 kernel: [ 32.035188] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT0._GTF, AE_NOT_FOUND (20170531/psparse-550) Oct 9 21:29:12 c18 kernel: [ 32.035206] ata6: SATA link down (SStatus 4 SControl 300) Oct 9 21:29:12 c18 kernel: [ 32.035225] ata2: SATA link down (SStatus 4 SControl 300) Oct 9 21:29:12 c18 kernel: [ 32.035231] ata1.00: supports DRM functions and may not be fully accessible Oct 9 21:29:12 c18 kernel: [ 32.035901] ata1.00: disabling queued TRIM support Oct 9 21:29:12 c18 kernel: [ 32.037457] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364) Oct 9 21:29:12 c18 kernel: [ 32.037459] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT0._GTF, AE_NOT_FOUND (20170531/psparse-550) Oct 9 21:29:12 c18 kernel: [ 32.037531] ata1.00: supports DRM functions and may not be fully accessible Oct 9 21:29:12 c18 kernel: [ 32.038084] ata1.00: disabling queued TRIM support Oct 9 21:29:12 c18 kernel: [ 32.038704] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364) Oct 9 21:29:12 c18 kernel: [ 32.038706] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT4._GTF, AE_NOT_FOUND (20170531/psparse-550) Oct 9 21:29:12 c18 kernel: [ 32.039418] ata1.00: configured for UDMA/133 Oct 9 21:29:12 c18 kernel: [ 32.055252] ata4: SATA link down (SStatus 4 SControl 300) Oct 9 21:29:12 c18 kernel: [ 32.082715] ACPI Error: [DSSP] Namespace lookup failure, AE_NOT_FOUND (20170531/psargs-364) Oct 9 21:29:12 c18 kernel: [ 32.082718] ACPI Error: Method parse/execution failed \_SB.PCI0.SAT0.PRT4._GTF, AE_NOT_FOUND (20170531/psparse-550) Oct 9 21:29:12 c18 kernel: [ 32.114002] ata5.00: configured for UDMA/133
Is there any chance you can bisect between 4.12.5 and 4.12.6?
Does the patch on this bug help? https://bugzilla.kernel.org/show_bug.cgi?id=196615
(In reply to Michel Dänzer from comment #7) > Is there any chance you can bisect between 4.12.5 and 4.12.6? I think this is the next step to diagnose that problem. I've never built and installed my own kernel before so I will have to learn how to do it. (In reply to Alex Deucher from comment #8) > Does the patch on this bug help? > https://bugzilla.kernel.org/show_bug.cgi?id=196615 It was my first clue as the kernel log looks a bit the same but the problematic commit identified in that bug ticket is present in 4.12.4 and 4.12.5, versions that work on my computer. When I learn how to build my own kernel I will try to remove the commit just to be sure.
I built v4.12.5 from git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git instead of using a prebuilt kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12.5/ and I unexpectedly faced the problem. I wonder why I cannot reproduce the problem with Ubuntu's version of kernel 4.12.5. Maybe there are some Ubuntu specific patches that hide the problem? Anyway, I rebuilt v4.2.5 again but this time I removed the code from the commit referenced on kernel.org Bugzilla: ------------------------------------------------------------------------------------------------------ diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c index c0a806280257..f862e3d9cd93 100644 --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cgs.c @@ -838,9 +838,10 @@ static int amdgpu_cgs_get_active_displays_info(struct cgs_device *cgs_device, return -EINVAL; mode_info = info->mode_info; + if (mode_info) { /* if the displays are off, vblank time is max */ - mode_info->vblank_time_us = 0xffffffff; + /*mode_info->vblank_time_us = 0xffffffff;*/ /* always set the reference clock */ mode_info->ref_clock = adev->clock.spll.reference_freq; } ------------------------------------------------------------------------------------------------------ And I cannot reproduce the problem anymore. So I guess this is actually the same problem than https://bugzilla.kernel.org/show_bug.cgi?id=196615
I've been setting vblank_time_us to 0x7fffffff, 0x0000ffff, 0x000000ff and I get the same problem.
Alex, I applied the changes you attached at kernel.org regarding /drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c to my 4.12.5 kernel (and I kept the line that sets vblank_time_us to 0xffffffff). It fixes the issue. I suspended and resumed my system twice without problem. Good catch!
Today I got a new kernel for Ubuntu 17.10: 4.13.0-19-generic #22-Ubuntu Everything works fine. I didn't test it on 4.14 but it seems the fix was reported there too, so if everybody is ok I will close this bug.
Fixed in 8b95f4f730cba02ef6febbdc4ca7e55ca045b00e
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.