Bug 92836

Summary: amdgpu does not resume properly from suspend
Product: DRI Reporter: David Walker <David>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED INVALID QA Contact:
Severity: normal    
Priority: medium CC: tiwai, vedran
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output from 4.3.0-1.g7b374a4-default
none
Xorg.0.log from 4.3.0-1.g7b374a4-default
none
suspend1.dmesg from 4.3.0-6.g6b3b033-default
none
suspend2.dmesg from 4.3.0-6.g6b3b033-default
none
suspend3.dmesg from 4.3.0-6.g6b3b033-default
none
Xorg.0.log from 4.3.0-6.g6b3b033-default none

Description David Walker 2015-11-05 21:27:43 UTC
My laptop does not resume properly after a suspend.  The problem seems to be with the amdgpu kernel module; it's often accompanied by a long stream of dmesg errors reported.  Here's a sampling:

 [ 1494.980561] amdgpu 0000:00:01.0: GPU fault detected: 146 0x0b020504
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 [ 1494.980561] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
 [ 1494.995478] systemd-journald[498]: /dev/kmsg buffer overrun, some messages lost.
 [ 1494.980561] amdgpu 0000:00:01.0: GPU fault detected: 146 0x0b0a4004
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 [ 1494.980561] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
 [ 1494.995486] systemd-journald[498]: /dev/kmsg buffer overrun, some messages lost.

Any ideas?  I'm running the 4.2.4-1-default under openSUSE Tumbleweed.  Here's some hwinfo data:

 09: PCI 01.0: 0300 VGA compatible controller (VGA)              
  [Created at pci.366]
  Unique ID: vSkL.bMI5Iw7ysWD
  SysFS ID: /devices/pci0000:00/0000:00:01.0
  SysFS BusID: 0000:00:01.0
  Hardware Class: graphics card
  Model: "ATI Carrizo"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x9874 "Carrizo"
  SubVendor: pci 0x103c "Hewlett-Packard Company"
  SubDevice: pci 0x80af 
  Revision: 0xc5
  Driver: "amdgpu"
  Driver Modules: "drm"
  Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable)
  Memory Range: 0xf0000000-0xf07fffff (ro,non-prefetchable)
  I/O Ports: 0xf000-0xf0ff (rw)
  Memory Range: 0xff700000-0xff73ffff (rw,non-prefetchable)
  Memory Range: 0xff740000-0xff75ffff (ro,non-prefetchable,disabled)
  IRQ: 47 (129945 events)
  Module Alias: "pci:v00001002d00009874sv0000103Csd000080AFbc03sc00i00"
  Driver Info #0:
    Driver Status: amdgpu is active
    Driver Activation Cmd: "modprobe amdgpu"
  Config Status: cfg=new, avail=yes, need=no, active=unknown
Comment 1 Alex Deucher 2015-11-05 22:34:48 UTC
Can you try kernel 4.3?
Comment 2 Alex Deucher 2015-11-05 22:35:08 UTC
Please attach your xorg log and dmesg output.
Comment 3 David Walker 2015-11-06 19:27:01 UTC
Created attachment 119452 [details]
dmesg output from 4.3.0-1.g7b374a4-default
Comment 4 David Walker 2015-11-06 19:27:40 UTC
Created attachment 119453 [details]
Xorg.0.log from 4.3.0-1.g7b374a4-default
Comment 5 David Walker 2015-11-06 19:32:05 UTC
I've attached dmesg and Xorg.0.log files for 4.3.0-1.g7b374a4.  It appears that the GPU faults have gone away, but the visual symptom is still the same; the screen is blank after a resume.

You'll also note that there are a *lot* of "xhci_hcd 0000:00:10.0: WARN Successful completion on short TX" messages in the dmesg output.  I suspect they're unrelated, but they don't appear under 4.2.3-1.4.
Comment 6 Alex Deucher 2015-11-06 20:09:04 UTC
Does booting with apci_osi=Linux on the kernel command line help?  A lot of new laptops use d3cold to support windows 10 which Linux in general doesn't support at the moment.
Comment 7 David Walker 2015-11-07 23:57:19 UTC
apci_osi=Linux doesn't seem to help.  I have found that it does recover sometimes, albeit rarely, and more often when running Gnome with Wayland, rather than X11, and sometimes only after control-alt-backspace.  I haven't done all that much testing, though, so this all may simply be coincidence.

Any other debugging I could do?
Comment 8 David Walker 2015-11-20 03:03:39 UTC
Created attachment 119962 [details]
suspend1.dmesg from 4.3.0-6.g6b3b033-default
Comment 9 David Walker 2015-11-20 03:04:11 UTC
Created attachment 119963 [details]
suspend2.dmesg from 4.3.0-6.g6b3b033-default
Comment 10 David Walker 2015-11-20 03:04:47 UTC
Created attachment 119964 [details]
suspend3.dmesg from 4.3.0-6.g6b3b033-default
Comment 11 David Walker 2015-11-20 03:05:19 UTC
Created attachment 119965 [details]
Xorg.0.log from 4.3.0-6.g6b3b033-default
Comment 12 David Walker 2015-11-20 03:13:53 UTC
Over the past couple of Tumbleweed kernel upgrades (most recently kernel-default-4.3.0-6.1.g6b3b033), I've noticed that resumes succeed sometimes, and that if a resume fails, another one or two suspend/resume cycles will result in a successful resume. FYI, I'm also using ucode-amd-20151109git-35.1 and kernel-firmware-20151109git-35.1.

I have attached Xorg.0.log and the following three "dmesg -c" outputs:

  suspend1.dmesg - before any suspends
  suspend2.dmesg - after two suspends, the first of which failed
  suspend3.dmesg - after one suspend that succeeded
Comment 13 Vedran Miletić 2016-05-22 16:53:25 UTC
David, can you try newer kernel? I get the same errors with Fedora kernel 4.5.4 on

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XT / Amethyst XT [Radeon R9 380X / R9 M295X] [1002:6938] (rev f1)

but the errors are non-fatal, i.e. there is graphics corruption but no hangs. Restarting X removes graphics corruption.
Comment 14 David Walker 2016-05-22 19:06:20 UTC
(In reply to Vedran Miletić from comment #13)
> David, can you try newer kernel? I get the same errors with Fedora kernel
> 4.5.4 on
> 
> 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
> [AMD/ATI] Tonga XT / Amethyst XT [Radeon R9 380X / R9 M295X] [1002:6938]
> (rev f1)
> 
> but the errors are non-fatal, i.e. there is graphics corruption but no
> hangs. Restarting X removes graphics corruption.

Sorry, Vendran, but I ended up buying another laptop (this time from a company that specializes in Linux laptops), so I no longer have a testbed for this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.