Bug 111244

Summary: amdgpu kernel 5.2 blank display after resume from suspend
Product: DRI Reporter: cspack
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: andrey.grodzovsky, carmen, cousinmarc, freedesktop, haagch, miba_c, nicholas.kazlauskas, rush, samuele.decarli, vicluo96
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=111122
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel log
none
first bisect log
none
second bisect log
none
Result of git bisect
none
Kernel log displaying issue
none
xf86-video-amdgpu git bisect
none
failed suspend log
none
failed suspend 5.2.14
none
attachment-25341-0.html
none
failed suspend 5.4.0rc2 none

Description cspack 2019-07-29 04:02:30 UTC
Created attachment 144901 [details]
kernel log

Model: Lenovo Ideapad S340 15"
CPU: AMD Ryzen 5 3500U

Starting with kernel 5.2, laptop has a blank display after resuming from suspend. Problem doesn't appear with recent kernels up to 5.1.16. Attached is a kernel log and git bisect logs.
Comment 1 cspack 2019-07-29 04:04:02 UTC
Created attachment 144902 [details]
first bisect log
Comment 2 cspack 2019-07-29 04:04:24 UTC
Created attachment 144903 [details]
second bisect log
Comment 3 Michel Dänzer 2019-07-29 08:37:44 UTC
The fact that you got two different bisection results indicates that the problem might be not 100% reproducible, and you accidentally marked some affected commits as good. Please test a given commit longer / more often before declaring it "good".
Comment 4 cspack 2019-07-29 09:22:25 UTC
The first bisect pointed to a merge commit so the second was done to bisect within the merged commits.
Comment 5 Michel Dänzer 2019-07-29 09:50:49 UTC
(In reply to cspack from comment #4)
> The first bisect pointed to a merge commit so the second was done to bisect
> within the merged commits.

That doesn't invalidate my previous comment. :) git bisect identifying a merge commit already indicates the same thing by itself. In particular, the fact that it identified a merge commit means that you declared all of its parent commits good.

(There *are* rare cases where a problem is actually introduced by a merge commit itself, but then the second bisection should have either identified the same merge commit again (if you tested it again), or failed, because all the other commits you tested should have been good again.)
Comment 6 cspack 2019-07-29 18:38:35 UTC
I see your point, and you are correct. It seems the issue is not 100% reproducible. I will redo the bisect and test more thoroughly. Thank you.
Comment 7 Samuele Decarli 2019-08-02 11:01:39 UTC
@cspack I am currently repeating your bisection on similar hardware, however I have found 27eaa4927dc3be669ed70670241597ac73595caf to be bad. Could you please retest that commit as well?
Comment 8 Samuele Decarli 2019-08-02 13:19:11 UTC
Created attachment 144928 [details]
Result of git bisect

Model: HP EliteBook 745 G5
CPU/GPU: AMD Ryzen 7 PRO 2700U

I completed my bisection and this is the log.
The first bad commit seems to be this one. It's actually a fairly innocent commit, so it's probably causing a bug somewhere else.

df8368be1382b442384507a5147c89978cd60702 is the first bad commit
commit df8368be1382b442384507a5147c89978cd60702
Author: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Date:   Wed Feb 27 12:56:36 2019 -0500

    drm/amdgpu: Bump amdgpu version for per-flip plane tiling updates
    
    To help xf86-video-amdgpu and mesa know DC supports updating the
    tiling attributes for a framebuffer per-flip.
    
    Cc: Michel Dänzer <michel@daenzer.net>
    Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 9 Samuele Decarli 2019-08-02 13:21:45 UTC
Created attachment 144929 [details]
Kernel log displaying issue

Compressed here is my kernel log, which shows repeating stack traces
Comment 10 Samuele Decarli 2019-08-02 13:27:44 UTC
Interestingly the same commit is blamed for anther issue
https://bugs.freedesktop.org/show_bug.cgi?id=111122
Comment 11 cspack 2019-08-02 18:14:19 UTC
@Samuele Yes, after redoing the bisect I got the same result as you did. Thanks.
Comment 12 cspack 2019-08-03 02:43:40 UTC
One thing to note, the problem doesn't seem to occur for me if a compositor isn't running. In my case, after disabling compton I could not reproduce the problem.
Comment 13 Samuele Decarli 2019-08-03 08:19:01 UTC
Similar thing for me: disabling composition in Plasma makes suspend/resume work again.
Comment 14 Damian Kaczmarek 2019-08-04 18:02:23 UTC
Another workaround is to switch to the text terminal (Ctrl+Alt+F2) before suspending.

Occuring on Thinkpad T495, Ryzej 3700U, openSUSE Tumbleweed (kernel 5.2.2-1)
Comment 15 Damian Kaczmarek 2019-08-05 18:59:41 UTC
Are any other known workarounds? Perhaps some kernel options?
Comment 16 cspack 2019-08-07 09:41:24 UTC
This commit in xf86-video-amdgpu seem to be where things break https://github.com/freedesktop/xorg-xf86-video-amdgpu/commit/a2b32e72fdaff3007a79b84929997d8176c2d512

Adding amdgpu.dc=1 to kernel options seems fix the issue for me.
Comment 17 cspack 2019-08-07 09:42:37 UTC
Created attachment 144967 [details]
xf86-video-amdgpu git bisect
Comment 18 Michel Dänzer 2019-08-07 09:51:22 UTC
(In reply to cspack from comment #16)
> Adding amdgpu.dc=1 to kernel options seems fix the issue for me.

Presumably you mean amdgpu.dc=0 ?

Your findings indicate that the kernel driver DC code doesn't handle flipping between buffers with different tiling parameters correctly in some cases.
Comment 19 cspack 2019-08-07 10:22:43 UTC
With amdgpu.dc=0, X doesn't start ((EE) AMDGPU(0): No modes.)
Comment 20 Michel Dänzer 2019-08-07 10:39:25 UTC
(In reply to cspack from comment #19)
> With amdgpu.dc=0, X doesn't start ((EE) AMDGPU(0): No modes.)

Right (I realized the amdgpu kernel driver doesn't support display with your GPU without DC), but amdgpu.dc=1 is the default. It was probably just luck that it worked once, which is why your first bisect attempts failed.
Comment 21 cspack 2019-08-07 13:48:28 UTC
The default is -1 according to the docs and /sys/module/amdgpu/parameters/dc. I assume it should effectively be the same but it seems to result in different behavior vs. setting it to 1. DC is enabled in both cases (the log shows "Display Core initialized"), but setting it to default results in a suspend/resume failure 100% of the time. Whereas setting it to 1 results in success most of time, although it did fail eventually after several reboots. Very strange.
Comment 22 Michel Dänzer 2019-08-07 13:58:31 UTC
(In reply to cspack from comment #21)
> The default is -1 according to the docs and
> /sys/module/amdgpu/parameters/dc.

What I meant is it's enabled by default for you, so amdgpu.dc=1 has no effect.


> I assume it should effectively be the same but it seems to result in different
> behavior vs. setting it to 1.

The different behaviour is just luck, which is why you had trouble bisecting initially, not related to amdgpu.dc=1.
Comment 23 Samuele Decarli 2019-08-07 21:43:22 UTC
amdgpu.dc=1 had no effect on my machine. On my computer resume fails quite consistently

Any idea on what should be done to fix this, or even what is the cause?
Comment 24 miba_c 2019-08-14 12:29:44 UTC
Having the same issue on a ThinkPad T495s (BIOS 1.06) with a Ryzen 7 PRO 3700U, Kernel 5.2.8-arch1-1-ARCH, Mesa 19.1.4-1 and running sway (wayland) as a window manager.

dmesg shows me:
[drm] Fence fallback timer expired on ring sdma0
amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
[drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test failed (-110).

One thing to note is that setting amd_iommu=off as a kernel parameter makes this issue really rare but it'll still sometimes happen, maybe it's also just luck.
Comment 25 Andrey Grodzovsky 2019-08-14 19:14:44 UTC
(In reply to miba_c from comment #24)
> Having the same issue on a ThinkPad T495s (BIOS 1.06) with a Ryzen 7 PRO
> 3700U, Kernel 5.2.8-arch1-1-ARCH, Mesa 19.1.4-1 and running sway (wayland)
> as a window manager.
> 
> dmesg shows me:
> [drm] Fence fallback timer expired on ring sdma0
> amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test
> failed on gfx (-110).
> [drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test
> failed (-110).
> 
> One thing to note is that setting amd_iommu=off as a kernel parameter makes
> this issue really rare but it'll still sometimes happen, maybe it's also
> just luck.

Please attach full log, also it looks log.
Comment 26 miba_c 2019-08-14 21:15:28 UTC
Created attachment 145065 [details]
failed suspend log

Attached full log
Comment 27 miba_c 2019-08-16 19:25:49 UTC
fwiw downgrading to 5.1.16 seems to fix the issue here too
Comment 28 miba_c 2019-09-17 21:32:55 UTC
Created attachment 145406 [details]
failed suspend 5.2.14

Still occasionally happens on 5.2.14. Hard to figure out what's causing this since it seems rather random and only happens once in a while.
Comment 29 Gabriel C 2019-09-19 19:37:28 UTC
Michael,

I see the same on a Ryzen 7 35750H APU + RX 560x Nitro5 Laptop.

reverting https://github.com/freedesktop/xorg-xf86-video-amdgpu/commit/a2b32e72fdaff3007a79b84929997d8176c2d512

fixes the problem for me.

I tested kernels 5.2*, 5.3, and all have the same problem 
when suspending from X with that commit, without the commit
everything is working fine.


( will test 5.4git once drm-next is in but I tested amd-staging-drm-next
some days ago and that didn't work also )

If you need more informations please let me know.

I can test any kind patches kernel/X/mesa and/or give
you debug info if you tell me what you may need.

Best Regards,

Gabriel C
Comment 30 Gabriel C 2019-09-19 19:38:53 UTC
(In reply to Gabriel C from comment #29)
> Michael,
> 
> I see the same on a Ryzen 7 35750H APU + RX 560x Nitro5 Laptop.

 It is 3750H :-)
Comment 31 Gabriel C 2019-09-20 12:05:10 UTC
I tested now 5.4git on commit 10169-g574cc4539762.

mesa 19.1.7
xorg-server 1.20.5
xf86-video-amdgpu 19.0.1

Suspend to ram seems to work better, 8 of 10 suspends worked.
Suspend to disk is still the same, broken.

With the commit reverted suspend to ram or to disk are fine.
Comment 32 miba_c 2019-10-04 13:47:19 UTC
Cannot reproduce it anymore as of 5.4rc1, seems like it's fixed for me!
Comment 33 Andrey Grodzovsky 2019-10-04 13:47:29 UTC
Created attachment 145645 [details]
attachment-25341-0.html

OOO today
Comment 34 Carmen Bianca Bakker 2019-10-22 11:45:57 UTC
Created attachment 145793 [details]
failed suspend 5.4.0rc2

Issue still occurs on 5.4rc2. In these logs, on the second suspension.

Thinkpad X395, Ryzen 3500U.
Comment 35 Martin Peres 2019-11-19 09:38:25 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/883.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.