Bug 106447 - System freeze after resuming from suspend (amdgpu)
Summary: System freeze after resuming from suspend (amdgpu)
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-09 06:04 UTC by Thomas Martitz
Modified: 2019-11-19 08:38 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (actually journalctl -k) output before suspend (84.33 KB, text/plain)
2018-05-09 11:55 UTC, Thomas Martitz
no flags Details
dmesg of 4.17-rc4 before suspend (83.72 KB, text/plain)
2018-05-09 15:25 UTC, Thomas Martitz
no flags Details
dmesg of last good commit after suspend (94.38 KB, text/plain)
2018-05-09 21:13 UTC, Thomas Martitz
no flags Details

Description Thomas Martitz 2018-05-09 06:04:24 UTC
Similar to #104649 but happening on amdgpu.

The system immediately locks up when resuming from suspend. I get to see the mouse cursor and the blue background of KDE's screen lock (but no password entry or anything like that), but cannot do anything.

I can also reproduce on 4.16.7 and 4.17-rc4. This does not happen with amdgpu blacklisted, or with Arch Linux' LTS kernel (4.14.39) though I get other random failures on the LTS kernel.

Unfortunately, the systemd journal does not contain anything after entering suspend so I have no possibility to get at a backtrace.

System:
Arch Linux w/ Linux 4.16.7
DMI: HP ZBook 14u G5/83B2, BIOS Q78 Ver. 01.00.05 01/25/2018
Intel Kaby Refresh 8550u
Intel UHD 620
AMD Radeon PRO WX 3100 (I believe this is Polaris, not sure about exact Generation)

lspci:
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers (rev 08)
00:02.0 VGA compatible controller: Intel Corporation UHD Graphics 620 (rev 07)
00:04.0 Signal processing controller: Intel Corporation Xeon E3-1200 v5/E3-1500 v5/6th Gen Core Processor Thermal Subsystem (rev 08)
00:14.0 USB controller: Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller (rev 21)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-LP Thermal subsystem (rev 21)
00:15.0 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #0 (rev 21)
00:15.1 Signal processing controller: Intel Corporation Sunrise Point-LP Serial IO I2C Controller #1 (rev 21)
00:16.0 Communication controller: Intel Corporation Sunrise Point-LP CSME HECI #1 (rev 21)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #1 (rev f1)
00:1c.3 PCI bridge: Intel Corporation Device 9d13 (rev f1)
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #5 (rev f1)
00:1d.0 PCI bridge: Intel Corporation Sunrise Point-LP PCI Express Root Port #9 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Device 9d4e (rev 21)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-LP PMC (rev 21)
00:1f.3 Audio device: Intel Corporation Sunrise Point-LP HD Audio (rev 21)
00:1f.4 SMBus: Intel Corporation Sunrise Point-LP SMBus (rev 21)
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (4) I219-V (rev 21)
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Lexa XT [Radeon PRO WX 3100]
02:00.0 Network controller: Intel Corporation Wireless 8265 / 8275 (rev 78)
3c:00.0 Non-Volatile memory controller: Toshiba America Info Systems Device 0116
Comment 1 Michel Dänzer 2018-05-09 07:44:17 UTC
Please attach the dmesg output captured before suspend.
Comment 2 Thomas Martitz 2018-05-09 11:55:09 UTC
Created attachment 139444 [details]
dmesg (actually journalctl -k) output before suspend
Comment 3 Michel Dänzer 2018-05-09 13:48:33 UTC
Does it also happen with amdgpu.dc=0?
Comment 4 Thomas Martitz 2018-05-09 15:23:36 UTC
Yes, on both 4.16.7 and 4.17-rc4
Comment 5 Thomas Martitz 2018-05-09 15:25:44 UTC
Created attachment 139445 [details]
dmesg of 4.17-rc4 before suspend
Comment 6 Thomas Martitz 2018-05-09 15:26:25 UTC
last dmesg output is with amdgpu.dc=0
Comment 7 Thomas Martitz 2018-05-09 21:13:03 UTC
I did a bisect and git reported this as the culprit:

kugel@thomas-nb:linux.git$ git bisect good
08810a4119aaebf6318f209ec5dd9828e969cba4 is the first bad commit
commit 08810a4119aaebf6318f209ec5dd9828e969cba4
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date:   Wed Oct 25 14:12:29 2017 +0200

    PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags
    
    The motivation for this change is to provide a way to work around
    a problem with the direct-complete mechanism used for avoiding
    system suspend/resume handling for devices in runtime suspend.
    
    The problem is that some middle layer code (the PCI bus type and
    the ACPI PM domain in particular) returns positive values from its
    system suspend ->prepare callbacks regardless of whether the driver's
    ->prepare returns a positive value or 0, which effectively prevents
    drivers from being able to control the direct-complete feature.
    Some drivers need that control, however, and the PCI bus type has
    grown its own flag to deal with this issue, but since it is not
    limited to PCI, it is better to address it by adding driver flags at
    the core level.
    
    To that end, add a driver_flags field to struct dev_pm_info for flags
    that can be set by device drivers at the probe time to inform the PM
    core and/or bus types, PM domains and so on on the capabilities and/or
    preferences of device drivers.  Also add two static inline helpers
    for setting that field and testing it against a given set of flags
    and make the driver core clear it automatically on driver remove
    and probe failures.
    
    Define and document two PM driver flags related to the direct-
    complete feature: NEVER_SKIP and SMART_PREPARE that can be used,
    respectively, to indicate to the PM core that the direct-complete
    mechanism should never be used for the device and to inform the
    middle layer code (bus types, PM domains etc) that it can only
    request the PM core to use the direct-complete mechanism for
    the device (by returning a positive value from its ->prepare
    callback) if it also has been requested by the driver.
    
    While at it, make the core check pm_runtime_suspended() when
    setting power.direct_complete so that it doesn't need to be
    checked by ->prepare callbacks.
    
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
    Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Acked-by: Bjorn Helgaas <bhelgaas@google.com>
    Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org>

:040000 040000 6f18a781ca7ee0501888a66532f0667f2926aeb1 440821a72777285dccc37d3a8254688bf4a24486 M      Documentation
:040000 040000 6aaceba7f5aae9368a1e6e287a1f56cb1326adbf 557c1672f5101aeae16ce6bda4969c42dd3321bb M      drivers
:040000 040000 bdc707f2a476baf517361c46ed28977cb30b6e1b 7c33fb89c953ad06a7b1c8b686d6b6a403aa509b M      include


(I haven't tried reverting just this on top of 4.16 yet).

Interestingly, this commit seems to also affect my wifi. I.e. the good commits (from the susped pov) do not have working wifi, while bad commits have working wifi.

I'll attach a dmesg output when running on the last good commit
Comment 8 Thomas Martitz 2018-05-09 21:13:46 UTC
Created attachment 139453 [details]
dmesg of last good commit after suspend
Comment 9 Thomas Martitz 2018-05-09 21:15:00 UTC
Here's the bisect log:

git bisect start
# bad: [75bc37fefc4471e718ba8e651aa74673d4e0a9eb] Linux 4.17-rc4
git bisect bad 75bc37fefc4471e718ba8e651aa74673d4e0a9eb
# good: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14
git bisect good bebc6082da0a9f5d47a1ea2edc099bf671058bd4
# bad: [e4ee8b85b7657d9c769b727038faabdc2e6a3412] Merge tag 'usb-4.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
git bisect bad e4ee8b85b7657d9c769b727038faabdc2e6a3412
# bad: [bec04432cb9036dedf89140c102b5ac03e4b3626] Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux
git bisect bad bec04432cb9036dedf89140c102b5ac03e4b3626
# bad: [5bbcc0f595fadb4cac0eddc4401035ec0bd95b09] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad 5bbcc0f595fadb4cac0eddc4401035ec0bd95b09
# bad: [2cd83ba5bede2f72cc6c79a19a1bddf576b50e88] Merge tag 'iommu-v4.15-rc1' of git://github.com/awilliam/linux-vfio
git bisect bad 2cd83ba5bede2f72cc6c79a19a1bddf576b50e88
# bad: [449fcf3ab0baf3dde9952385e6789f2ca10c3980] Merge tag 'staging-4.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad 449fcf3ab0baf3dde9952385e6789f2ca10c3980
# good: [43ff2f4db9d0f76452b77cfa645f02b471143b24] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 43ff2f4db9d0f76452b77cfa645f02b471143b24
# good: [43ff2f4db9d0f76452b77cfa645f02b471143b24] Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 43ff2f4db9d0f76452b77cfa645f02b471143b24
# good: [313144c1bcd6dd22f2375a602a8cb6efa759c8cd] Staging: rtlwifi: pci: fixed a coding style issue
git bisect good 313144c1bcd6dd22f2375a602a8cb6efa759c8cd
# good: [b18d62891aaff49d0ee8367d4b6bb9452469f807] Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good b18d62891aaff49d0ee8367d4b6bb9452469f807
# bad: [990a848d537e4da966907c8ccec95bc568f2911c] Merge branches 'pm-devfreq' and 'pm-tools'
git bisect bad 990a848d537e4da966907c8ccec95bc568f2911c
# good: [60af981c78a72255355c8e374e173b550d6742d6] Merge branch 'pm-cpufreq'
git bisect good 60af981c78a72255355c8e374e173b550d6742d6
# good: [05d658b5b57214944067fb4f62bce59200bf496f] Merge branch 'pm-sleep'
git bisect good 05d658b5b57214944067fb4f62bce59200bf496f
# bad: [1efef68262dc567f0c09da9d11924e8287cd3a8b] Merge branch 'pm-core'
git bisect bad 1efef68262dc567f0c09da9d11924e8287cd3a8b
# bad: [08810a4119aaebf6318f209ec5dd9828e969cba4] PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags
git bisect bad 08810a4119aaebf6318f209ec5dd9828e969cba4
# good: [b082ddd8a6a3aa0399763bfb58fc7bdd84c95713] PM / core: Fix kerneldoc comments of four functions
git bisect good b082ddd8a6a3aa0399763bfb58fc7bdd84c95713
# good: [69a10ca747c2d2d7c0354a883335e097c067ed35] Merge branch 'acpi-pm' into pm-core
git bisect good 69a10ca747c2d2d7c0354a883335e097c067ed35
# first bad commit: [08810a4119aaebf6318f209ec5dd9828e969cba4] PM / core: Add NEVER_SKIP and SMART_PREPARE driver flags
Comment 10 Michel Dänzer 2018-05-11 07:54:11 UTC
Looks like you should report this at https://bugzilla.kernel.org/enter_bug.cgi?product=Power%20Management&component=Hibernation/Suspend .
Comment 11 Thomas Martitz 2018-05-11 09:33:04 UTC
I can suspend+resume just fine with amdgpu blacklisted, so I'm under the impression that this is the right place.
Comment 12 Michel Dänzer 2018-05-11 10:15:56 UTC
That's debatable, given you bisected to a non-amdgpu commit, which affects WiFi as well.
Comment 13 Thomas Martitz 2018-05-11 10:26:00 UTC
I'll report the bug on the other site as well. 

In my view: Loading the amdgpu module breaks resuming from suspend. Maybe the module isn't correctly adapted to the changes made in generic subsystems earlier.
Comment 14 john-s-84 2018-05-11 13:56:06 UTC
Same Problem here (HP zbook 15u 5g).

https://bugzilla.kernel.org/show_bug.cgi?id=199609

Chen Yu recommended to write a request on amd-gfx@lists.freedesktop.org with no success so far.

https://lists.freedesktop.org/archives/amd-gfx/2018-May/022064.html
Comment 15 Thomas Martitz 2018-05-11 15:14:25 UTC
I investigated the commit found by git bisect a bit more, and found that the following patch (which reverts part of said commit) repairs resuming.

I can't tell the consequences, however reading the commit message suggests this part is non-critical:

> While at it, make the core check pm_runtime_suspended() when
> setting power.direct_complete so that it doesn't need to be
> checked by ->prepare callbacks.

diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
index 02a497e7c785..028c14386e5d 100644
--- a/drivers/base/power/main.c
+++ b/drivers/base/power/main.c
@@ -1959,9 +1959,7 @@ static int device_prepare(struct device *dev, pm_message_t state)
         * applies to suspend transitions, however.
         */
        spin_lock_irq(&dev->power.lock);
-       dev->power.direct_complete = state.event == PM_EVENT_SUSPEND &&
-               pm_runtime_suspended(dev) && ret > 0 &&
-               !dev_pm_test_driver_flags(dev, DPM_FLAG_NEVER_SKIP);
+       dev->power.direct_complete = ret > 0 && state.event == PM_EVENT_SUSPEND;
        spin_unlock_irq(&dev->power.lock);
        return 0;
 }

So, what to do with this information / potential fix?
Comment 16 Alex Deucher 2018-05-11 15:18:46 UTC
(In reply to Thomas Martitz from comment #15)
> 
> So, what to do with this information / potential fix?

Please file a bug as per comment 10 and include that information.
Comment 17 Thomas Martitz 2018-05-11 15:24:26 UTC
Done, https://bugzilla.kernel.org/show_bug.cgi?id=199693
Comment 18 richard 2019-07-15 19:00:28 UTC
Hello ,

i don't know if this is the correct place to state this,
but i have a Desktop PC running 

Ubuntu 16.04 

and i noticed too that the system won't resume from suspend after installation of 

amdgpu-driver of rx560


this happens with the AMD Driver for 16.04 Xenial

https://www.amd.com/en/support/kb/release-notes/rn-prorad-lin-amdgpupro


but also with the AMDGPU.PRO driver for 18.04

https://www.amd.com/pl/support/1881



I have to say that i use Kubuntu instead of Ubuntu.

 4.10.0-28-generic #32~16.04.2-Ubuntu SMP Thu Jul 20 10:19:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux


The System is

Ryzen 5 2600
AMD RX560 2GB
16 GB RAM.



Withouth the AMDGPU Driver there is no problem with suspend as it seems.
Comment 19 Martin Peres 2019-11-19 08:38:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/380.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.