92836 – amdgpu does not resume properly from suspend

Bug 92836 - amdgpu does not resume properly from suspend

Summary: amdgpu does not resume properly from suspend

Status:	RESOLVED INVALID

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-11-05 21:27 UTC by David Walker
Modified:	2017-10-20 17:22 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg output from 4.3.0-1.g7b374a4-default (288.34 KB, text/plain) 2015-11-06 19:27 UTC, David Walker	no flags	Details
Xorg.0.log from 4.3.0-1.g7b374a4-default (43.73 KB, text/plain) 2015-11-06 19:27 UTC, David Walker	no flags	Details
suspend1.dmesg from 4.3.0-6.g6b3b033-default (86.88 KB, text/plain) 2015-11-20 03:03 UTC, David Walker	no flags	Details
suspend2.dmesg from 4.3.0-6.g6b3b033-default (25.06 KB, text/plain) 2015-11-20 03:04 UTC, David Walker	no flags	Details
suspend3.dmesg from 4.3.0-6.g6b3b033-default (16.63 KB, text/plain) 2015-11-20 03:04 UTC, David Walker	no flags	Details
Xorg.0.log from 4.3.0-6.g6b3b033-default (41.64 KB, text/plain) 2015-11-20 03:05 UTC, David Walker	no flags	Details
View All

Description David Walker 2015-11-05 21:27:43 UTC

My laptop does not resume properly after a suspend.  The problem seems to be with the amdgpu kernel module; it's often accompanied by a long stream of dmesg errors reported.  Here's a sampling:

 [ 1494.980561] amdgpu 0000:00:01.0: GPU fault detected: 146 0x0b020504
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 [ 1494.980561] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
 [ 1494.995478] systemd-journald[498]: /dev/kmsg buffer overrun, some messages lost.
 [ 1494.980561] amdgpu 0000:00:01.0: GPU fault detected: 146 0x0b0a4004
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
 [ 1494.980561] amdgpu 0000:00:01.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x00000000
 [ 1494.980561] VM fault (0x00, vmid 0) at page 0, read from '' (0x00000000) (0)
 [ 1494.995486] systemd-journald[498]: /dev/kmsg buffer overrun, some messages lost.

Any ideas?  I'm running the 4.2.4-1-default under openSUSE Tumbleweed.  Here's some hwinfo data:

 09: PCI 01.0: 0300 VGA compatible controller (VGA)              
  [Created at pci.366]
  Unique ID: vSkL.bMI5Iw7ysWD
  SysFS ID: /devices/pci0000:00/0000:00:01.0
  SysFS BusID: 0000:00:01.0
  Hardware Class: graphics card
  Model: "ATI Carrizo"
  Vendor: pci 0x1002 "ATI Technologies Inc"
  Device: pci 0x9874 "Carrizo"
  SubVendor: pci 0x103c "Hewlett-Packard Company"
  SubDevice: pci 0x80af 
  Revision: 0xc5
  Driver: "amdgpu"
  Driver Modules: "drm"
  Memory Range: 0xe0000000-0xefffffff (ro,non-prefetchable)
  Memory Range: 0xf0000000-0xf07fffff (ro,non-prefetchable)
  I/O Ports: 0xf000-0xf0ff (rw)
  Memory Range: 0xff700000-0xff73ffff (rw,non-prefetchable)
  Memory Range: 0xff740000-0xff75ffff (ro,non-prefetchable,disabled)
  IRQ: 47 (129945 events)
  Module Alias: "pci:v00001002d00009874sv0000103Csd000080AFbc03sc00i00"
  Driver Info #0:
    Driver Status: amdgpu is active
    Driver Activation Cmd: "modprobe amdgpu"
  Config Status: cfg=new, avail=yes, need=no, active=unknown

Comment 1 Alex Deucher 2015-11-05 22:34:48 UTC

Can you try kernel 4.3?

Comment 2 Alex Deucher 2015-11-05 22:35:08 UTC

Please attach your xorg log and dmesg output.

Comment 3 David Walker 2015-11-06 19:27:01 UTC

Created attachment 119452 [details]
dmesg output from 4.3.0-1.g7b374a4-default

Comment 4 David Walker 2015-11-06 19:27:40 UTC

Created attachment 119453 [details]
Xorg.0.log from 4.3.0-1.g7b374a4-default

Comment 5 David Walker 2015-11-06 19:32:05 UTC

I've attached dmesg and Xorg.0.log files for 4.3.0-1.g7b374a4.  It appears that the GPU faults have gone away, but the visual symptom is still the same; the screen is blank after a resume.

You'll also note that there are a *lot* of "xhci_hcd 0000:00:10.0: WARN Successful completion on short TX" messages in the dmesg output.  I suspect they're unrelated, but they don't appear under 4.2.3-1.4.

Comment 6 Alex Deucher 2015-11-06 20:09:04 UTC

Does booting with apci_osi=Linux on the kernel command line help?  A lot of new laptops use d3cold to support windows 10 which Linux in general doesn't support at the moment.

Comment 7 David Walker 2015-11-07 23:57:19 UTC

apci_osi=Linux doesn't seem to help.  I have found that it does recover sometimes, albeit rarely, and more often when running Gnome with Wayland, rather than X11, and sometimes only after control-alt-backspace.  I haven't done all that much testing, though, so this all may simply be coincidence.

Any other debugging I could do?

Comment 8 David Walker 2015-11-20 03:03:39 UTC

Created attachment 119962 [details]
suspend1.dmesg from 4.3.0-6.g6b3b033-default

Comment 9 David Walker 2015-11-20 03:04:11 UTC

Created attachment 119963 [details]
suspend2.dmesg from 4.3.0-6.g6b3b033-default

Comment 10 David Walker 2015-11-20 03:04:47 UTC

Created attachment 119964 [details]
suspend3.dmesg from 4.3.0-6.g6b3b033-default

Comment 11 David Walker 2015-11-20 03:05:19 UTC

Created attachment 119965 [details]
Xorg.0.log from 4.3.0-6.g6b3b033-default

Comment 12 David Walker 2015-11-20 03:13:53 UTC

Over the past couple of Tumbleweed kernel upgrades (most recently kernel-default-4.3.0-6.1.g6b3b033), I've noticed that resumes succeed sometimes, and that if a resume fails, another one or two suspend/resume cycles will result in a successful resume. FYI, I'm also using ucode-amd-20151109git-35.1 and kernel-firmware-20151109git-35.1.

I have attached Xorg.0.log and the following three "dmesg -c" outputs:

  suspend1.dmesg - before any suspends
  suspend2.dmesg - after two suspends, the first of which failed
  suspend3.dmesg - after one suspend that succeeded

Comment 13 Vedran Miletić 2016-05-22 16:53:25 UTC

David, can you try newer kernel? I get the same errors with Fedora kernel 4.5.4 on

01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tonga XT / Amethyst XT [Radeon R9 380X / R9 M295X] [1002:6938] (rev f1)

but the errors are non-fatal, i.e. there is graphics corruption but no hangs. Restarting X removes graphics corruption.

Comment 14 David Walker 2016-05-22 19:06:20 UTC

(In reply to Vedran Miletić from comment #13)
> David, can you try newer kernel? I get the same errors with Fedora kernel
> 4.5.4 on
> 
> 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc.
> [AMD/ATI] Tonga XT / Amethyst XT [Radeon R9 380X / R9 M295X] [1002:6938]
> (rev f1)
> 
> but the errors are non-fatal, i.e. there is graphics corruption but no
> hangs. Restarting X removes graphics corruption.

Sorry, Vendran, but I ended up buying another laptop (this time from a company that specializes in Linux laptops), so I no longer have a testbed for this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.