Bug 111244 - amdgpu kernel 5.2 blank display after resume from suspend
Summary: amdgpu kernel 5.2 blank display after resume from suspend
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-29 04:02 UTC by cspack
Modified: 2019-08-16 19:25 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
kernel log (172.35 KB, text/plain)
2019-07-29 04:02 UTC, cspack
no flags Details
first bisect log (2.86 KB, text/plain)
2019-07-29 04:04 UTC, cspack
no flags Details
second bisect log (1.13 KB, text/plain)
2019-07-29 04:04 UTC, cspack
no flags Details
Result of git bisect (2.70 KB, text/plain)
2019-08-02 13:19 UTC, Samuele Decarli
no flags Details
Kernel log displaying issue (187.23 KB, application/gzip)
2019-08-02 13:21 UTC, Samuele Decarli
no flags Details
xf86-video-amdgpu git bisect (961 bytes, text/plain)
2019-08-07 09:42 UTC, cspack
no flags Details
failed suspend log (195.28 KB, text/plain)
2019-08-14 21:15 UTC, miba_c
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description cspack 2019-07-29 04:02:30 UTC
Created attachment 144901 [details]
kernel log

Model: Lenovo Ideapad S340 15"
CPU: AMD Ryzen 5 3500U

Starting with kernel 5.2, laptop has a blank display after resuming from suspend. Problem doesn't appear with recent kernels up to 5.1.16. Attached is a kernel log and git bisect logs.
Comment 1 cspack 2019-07-29 04:04:02 UTC
Created attachment 144902 [details]
first bisect log
Comment 2 cspack 2019-07-29 04:04:24 UTC
Created attachment 144903 [details]
second bisect log
Comment 3 Michel Dänzer 2019-07-29 08:37:44 UTC
The fact that you got two different bisection results indicates that the problem might be not 100% reproducible, and you accidentally marked some affected commits as good. Please test a given commit longer / more often before declaring it "good".
Comment 4 cspack 2019-07-29 09:22:25 UTC
The first bisect pointed to a merge commit so the second was done to bisect within the merged commits.
Comment 5 Michel Dänzer 2019-07-29 09:50:49 UTC
(In reply to cspack from comment #4)
> The first bisect pointed to a merge commit so the second was done to bisect
> within the merged commits.

That doesn't invalidate my previous comment. :) git bisect identifying a merge commit already indicates the same thing by itself. In particular, the fact that it identified a merge commit means that you declared all of its parent commits good.

(There *are* rare cases where a problem is actually introduced by a merge commit itself, but then the second bisection should have either identified the same merge commit again (if you tested it again), or failed, because all the other commits you tested should have been good again.)
Comment 6 cspack 2019-07-29 18:38:35 UTC
I see your point, and you are correct. It seems the issue is not 100% reproducible. I will redo the bisect and test more thoroughly. Thank you.
Comment 7 Samuele Decarli 2019-08-02 11:01:39 UTC
@cspack I am currently repeating your bisection on similar hardware, however I have found 27eaa4927dc3be669ed70670241597ac73595caf to be bad. Could you please retest that commit as well?
Comment 8 Samuele Decarli 2019-08-02 13:19:11 UTC
Created attachment 144928 [details]
Result of git bisect

Model: HP EliteBook 745 G5
CPU/GPU: AMD Ryzen 7 PRO 2700U

I completed my bisection and this is the log.
The first bad commit seems to be this one. It's actually a fairly innocent commit, so it's probably causing a bug somewhere else.

df8368be1382b442384507a5147c89978cd60702 is the first bad commit
commit df8368be1382b442384507a5147c89978cd60702
Author: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
Date:   Wed Feb 27 12:56:36 2019 -0500

    drm/amdgpu: Bump amdgpu version for per-flip plane tiling updates
    
    To help xf86-video-amdgpu and mesa know DC supports updating the
    tiling attributes for a framebuffer per-flip.
    
    Cc: Michel Dänzer <michel@daenzer.net>
    Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com>
    Acked-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Marek Olšák <marek.olsak@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)
Comment 9 Samuele Decarli 2019-08-02 13:21:45 UTC
Created attachment 144929 [details]
Kernel log displaying issue

Compressed here is my kernel log, which shows repeating stack traces
Comment 10 Samuele Decarli 2019-08-02 13:27:44 UTC
Interestingly the same commit is blamed for anther issue
https://bugs.freedesktop.org/show_bug.cgi?id=111122
Comment 11 cspack 2019-08-02 18:14:19 UTC
@Samuele Yes, after redoing the bisect I got the same result as you did. Thanks.
Comment 12 cspack 2019-08-03 02:43:40 UTC
One thing to note, the problem doesn't seem to occur for me if a compositor isn't running. In my case, after disabling compton I could not reproduce the problem.
Comment 13 Samuele Decarli 2019-08-03 08:19:01 UTC
Similar thing for me: disabling composition in Plasma makes suspend/resume work again.
Comment 14 Damian Kaczmarek 2019-08-04 18:02:23 UTC
Another workaround is to switch to the text terminal (Ctrl+Alt+F2) before suspending.

Occuring on Thinkpad T495, Ryzej 3700U, openSUSE Tumbleweed (kernel 5.2.2-1)
Comment 15 Damian Kaczmarek 2019-08-05 18:59:41 UTC
Are any other known workarounds? Perhaps some kernel options?
Comment 16 cspack 2019-08-07 09:41:24 UTC
This commit in xf86-video-amdgpu seem to be where things break https://github.com/freedesktop/xorg-xf86-video-amdgpu/commit/a2b32e72fdaff3007a79b84929997d8176c2d512

Adding amdgpu.dc=1 to kernel options seems fix the issue for me.
Comment 17 cspack 2019-08-07 09:42:37 UTC
Created attachment 144967 [details]
xf86-video-amdgpu git bisect
Comment 18 Michel Dänzer 2019-08-07 09:51:22 UTC
(In reply to cspack from comment #16)
> Adding amdgpu.dc=1 to kernel options seems fix the issue for me.

Presumably you mean amdgpu.dc=0 ?

Your findings indicate that the kernel driver DC code doesn't handle flipping between buffers with different tiling parameters correctly in some cases.
Comment 19 cspack 2019-08-07 10:22:43 UTC
With amdgpu.dc=0, X doesn't start ((EE) AMDGPU(0): No modes.)
Comment 20 Michel Dänzer 2019-08-07 10:39:25 UTC
(In reply to cspack from comment #19)
> With amdgpu.dc=0, X doesn't start ((EE) AMDGPU(0): No modes.)

Right (I realized the amdgpu kernel driver doesn't support display with your GPU without DC), but amdgpu.dc=1 is the default. It was probably just luck that it worked once, which is why your first bisect attempts failed.
Comment 21 cspack 2019-08-07 13:48:28 UTC
The default is -1 according to the docs and /sys/module/amdgpu/parameters/dc. I assume it should effectively be the same but it seems to result in different behavior vs. setting it to 1. DC is enabled in both cases (the log shows "Display Core initialized"), but setting it to default results in a suspend/resume failure 100% of the time. Whereas setting it to 1 results in success most of time, although it did fail eventually after several reboots. Very strange.
Comment 22 Michel Dänzer 2019-08-07 13:58:31 UTC
(In reply to cspack from comment #21)
> The default is -1 according to the docs and
> /sys/module/amdgpu/parameters/dc.

What I meant is it's enabled by default for you, so amdgpu.dc=1 has no effect.


> I assume it should effectively be the same but it seems to result in different
> behavior vs. setting it to 1.

The different behaviour is just luck, which is why you had trouble bisecting initially, not related to amdgpu.dc=1.
Comment 23 Samuele Decarli 2019-08-07 21:43:22 UTC
amdgpu.dc=1 had no effect on my machine. On my computer resume fails quite consistently

Any idea on what should be done to fix this, or even what is the cause?
Comment 24 miba_c 2019-08-14 12:29:44 UTC
Having the same issue on a ThinkPad T495s (BIOS 1.06) with a Ryzen 7 PRO 3700U, Kernel 5.2.8-arch1-1-ARCH, Mesa 19.1.4-1 and running sway (wayland) as a window manager.

dmesg shows me:
[drm] Fence fallback timer expired on ring sdma0
amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test failed on gfx (-110).
[drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test failed (-110).

One thing to note is that setting amd_iommu=off as a kernel parameter makes this issue really rare but it'll still sometimes happen, maybe it's also just luck.
Comment 25 Andrey Grodzovsky 2019-08-14 19:14:44 UTC
(In reply to miba_c from comment #24)
> Having the same issue on a ThinkPad T495s (BIOS 1.06) with a Ryzen 7 PRO
> 3700U, Kernel 5.2.8-arch1-1-ARCH, Mesa 19.1.4-1 and running sway (wayland)
> as a window manager.
> 
> dmesg shows me:
> [drm] Fence fallback timer expired on ring sdma0
> amdgpu 0000:05:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB test
> failed on gfx (-110).
> [drm:amdgpu_device_ip_late_init_func_handler [amdgpu]] *ERROR* ib ring test
> failed (-110).
> 
> One thing to note is that setting amd_iommu=off as a kernel parameter makes
> this issue really rare but it'll still sometimes happen, maybe it's also
> just luck.

Please attach full log, also it looks log.
Comment 26 miba_c 2019-08-14 21:15:28 UTC
Created attachment 145065 [details]
failed suspend log

Attached full log
Comment 27 miba_c 2019-08-16 19:25:49 UTC
fwiw downgrading to 5.1.16 seems to fix the issue here too


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.