Bug 103277 - [bisected] Systems hangs on resume from S3 sleep due to "Match actual state during S3 resume" commit
Summary: [bisected] Systems hangs on resume from S3 sleep due to "Match actual state d...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium blocker
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-10-14 16:06 UTC by dwagner
Modified: 2018-12-10 21:10 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg of system before it fails to resume from S3 sleep (68.41 KB, text/plain)
2017-10-14 16:06 UTC, dwagner
no flags Details
kernel .config as used with git at commit 61deb7d0dddd941d1e3ffee0d799396ac93b0e90 (HEAD, origin/drm-next-4.17-wip) (179.46 KB, text/plain)
2018-03-17 11:18 UTC, dwagner
no flags Details

Description dwagner 2017-10-14 16:06:04 UTC
After I updated my kernel to https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next as of today (latest commit at this time: 	1c630e83443a0f271c192ecfa0d94023661a0000) I noticed that my computer would no longer wake up from S3 sleep - power LED goes on, but apart from that no display and no reaction to any input (other than Num-Lock still switchable).

The difference to the ~2 weeks older kernel was 100% reproducable by just doing:

- boot to console (no start of X11 required) and login as root
- echo "mem" >/sys/power/state
- wait till power LED blinks
- press some key to wake up the system
- observe no display output and no reaction to anything

GPU is a RX 460, CPU a Ryzen R7 1800X, board Asus X370 Pro, connected is one 4k display via HDMI 2.0.

I bisected the git commits, and the one that causes the bug is:

>  7ae4acd21e9e264afb079e23d43bcf2238c7dbea
>  drm/amd/display: Match actual state during S3 resume.

After this, I went back to the current HEAD of amd-staging-drm-next and (manually) reverted only commit 7ae4acd21e9e264afb079e23d43bcf2238c7dbea, and this indeed results in a kernel that works fine with regards to resuming from S3 sleep.

There are no noteworthy "dmesg"-emissions that accompany going to S3 sleep, but the dmesg-output that I can collect ends slightly before the system actually sleeps, so wether anything is emitted upon the wake-up attempt, I have no way to know.
Comment 1 dwagner 2017-10-14 16:06:56 UTC
Created attachment 134842 [details]
dmesg of system before it fails to resume from S3 sleep
Comment 2 Jordan L 2017-10-15 17:24:43 UTC
(In reply to dwagner from comment #0)
> After I updated my kernel to
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next as of
> today (latest commit at this time: 
> 1c630e83443a0f271c192ecfa0d94023661a0000) I noticed that my computer would
> no longer wake up from S3 sleep - power LED goes on, but apart from that no
> display and no reaction to any input (other than Num-Lock still switchable).
> 
> The difference to the ~2 weeks older kernel was 100% reproducable by just
> doing:
> 
> - boot to console (no start of X11 required) and login as root
> - echo "mem" >/sys/power/state
> - wait till power LED blinks
> - press some key to wake up the system
> - observe no display output and no reaction to anything
> 
> GPU is a RX 460, CPU a Ryzen R7 1800X, board Asus X370 Pro, connected is one
> 4k display via HDMI 2.0.
> 
> I bisected the git commits, and the one that causes the bug is:
> 
> >  7ae4acd21e9e264afb079e23d43bcf2238c7dbea
> >  drm/amd/display: Match actual state during S3 resume.
> 
> After this, I went back to the current HEAD of amd-staging-drm-next and
> (manually) reverted only commit 7ae4acd21e9e264afb079e23d43bcf2238c7dbea,
> and this indeed results in a kernel that works fine with regards to resuming
> from S3 sleep.
> 
> There are no noteworthy "dmesg"-emissions that accompany going to S3 sleep,
> but the dmesg-output that I can collect ends slightly before the system
> actually sleeps, so wether anything is emitted upon the wake-up attempt, I
> have no way to know.

Thanks, we can reproduce this too, should have something shortly.
Comment 3 dwagner 2017-11-11 18:41:06 UTC
(In reply to Jordan L from comment #2)
> Thanks, we can reproduce this too, should have something shortly.

Any news on this?

I am asking, because the current head of https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next (as of commit d66bebab317cc56a2a6d2438fcd89870c3d172ca) still has this very bug - and that is even though within "git log" I can see in commit 26c860d5579684528114c3875ef88f7796330eb5 there was
> Revert "drm/amd/display: Match actual state during S3 resume."
so I currently know of no way to make the current amd-staging-drm-next resume from S3 sleep.
Comment 4 Jerry Zuo 2017-12-11 21:55:42 UTC
(In reply to dwagner from comment #3)
> (In reply to Jordan L from comment #2)
> > Thanks, we can reproduce this too, should have something shortly.
> 
> Any news on this?
> 
> I am asking, because the current head of
> https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next (as of
> commit d66bebab317cc56a2a6d2438fcd89870c3d172ca) still has this very bug -
> and that is even though within "git log" I can see in commit
> 26c860d5579684528114c3875ef88f7796330eb5 there was
> > Revert "drm/amd/display: Match actual state during S3 resume."
> so I currently know of no way to make the current amd-staging-drm-next
> resume from S3 sleep.

S3 starts working since commit 95539e2be57. 

Test is performed on the following condition:
Commit: c6f284d9888
Asic: Baffin
4K display: Acer H277HK
S3 passed in 10 runs

Please verify at your side as well. Thanks.
Comment 5 dwagner 2017-12-12 23:10:07 UTC
(In reply to Jerry Zuo from comment #4)
> S3 starts working since commit 95539e2be57. 
> 
> Test is performed on the following condition:
> Commit: c6f284d9888
> Asic: Baffin
> 4K display: Acer H277HK
> S3 passed in 10 runs
> 
> Please verify at your side as well. Thanks.

I would love to verify this - but where can I find commit 95539e2be57 ?

It is not included in any branch of the https://cgit.freedesktop.org/~agd5f/linux/ repository as of this moment - can you point me to a repository that includes it?
Comment 8 dwagner 2017-12-13 00:09:05 UTC
Thanks, Alex.
Just in order to be able to help myself next time: How does one get from knowing the string "95539e2be57" to the commits you linked?

(Will try amd-staging-drm-next right now and follow up with the result after rebooting.)
Comment 9 dwagner 2017-12-13 00:22:58 UTC
Bad news: Tried amd-staging-drm-next as of commit 367a3d2bdc27fd1d23be9ea75cec34b52297184d, which does include the commit https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=c2a899da6d8c0658c6f8493cb6b5ca4e890a15b7 referenced by Alex above, but the result is still the same as in the original description of this bug report.

I tested "echo "mem" >/sys/power/state" both without starting X and from an Xterm, in both cases no picture comes up at resume from S3, and the computer does not react on input.

The test scenario Jerry Zuo described above does not mention whether the display was connected via DP or HDMI - could this be of relevance for reproduction?
Comment 10 Jerry Zuo 2017-12-13 16:11:12 UTC
(In reply to dwagner from comment #9)
> Bad news: Tried amd-staging-drm-next as of commit
> 367a3d2bdc27fd1d23be9ea75cec34b52297184d, which does include the commit
> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-
> next&id=c2a899da6d8c0658c6f8493cb6b5ca4e890a15b7 referenced by Alex above,
> but the result is still the same as in the original description of this bug
> report.
> 
> I tested "echo "mem" >/sys/power/state" both without starting X and from an
> Xterm, in both cases no picture comes up at resume from S3, and the computer
> does not react on input.
> 
> The test scenario Jerry Zuo described above does not mention whether the
> display was connected via DP or HDMI - could this be of relevance for
> reproduction?

I tested on HDMI in 4K display. I'll check the commit on drm-next.
Comment 11 dwagner 2018-01-20 13:38:05 UTC
Just for you information: S3 resumes still do not work in 100% of attempts, also with the current git version of https://cgit.freedesktop.org/~agd5f/linux/log/?h=amd-staging-drm-next

This is really frustrating. 3 months after this bug first appeared (and by that time was revertable by backing out the commit that caused it), all contemporary kernel versions, both from git and mainline, fail to resume from S3 all the time.

I guess I will have to buy a different GPU :-(
Comment 12 Harry Wentland 2018-01-22 15:06:25 UTC
Can you try blacklisting amdgpu and try S3 again?

We've seen issues with S3 on 4.15 RCs outside of amdgpu where the system wouldn't come back from S3. It's fixed in more recent RCs (definitely 4.15-rc6 and newer) but the issue might still be there in amd-staging-drm-next.
Comment 13 dwagner 2018-01-22 23:08:00 UTC
(In reply to Harry Wentland from comment #12)
> Can you try blacklisting amdgpu and try S3 again?

(Assuming you mean adding "module_blacklist=amdgpu" on the kernel command line:)

Sure, here is what happenes:

kernel old = amd-staging-drm-next as of early October 2017
kernel new = amd-staging-drm-next as of this week

          amdgpu
kernel  blacklisted  symptom after S3 resume
--------------------------------------------------------
old         no       resumes fine every time
old        yes       no HDMI signal, but system runs otherwise
new         no       no HDMI signal, system crashes every time
new        yes       no HDMI signal, but system runs otherwise

> We've seen issues with S3 on 4.15 RCs outside of amdgpu where the system
> wouldn't come back from S3. It's fixed in more recent RCs (definitely
> 4.15-rc6 and newer) but the issue might still be there in
> amd-staging-drm-next.

Hmmm. Would "drm-next-4.16" include this fix?
Comment 14 Alex Deucher 2018-01-23 04:25:42 UTC
The regression was caused by:

commit ca37e57bbe0cf1455ea3e84eb89ed04a132d59e1 (refs/bisect/bad)
Author: Andy Lutomirski <luto@kernel.org>
Date:   Wed Nov 22 20:39:16 2017 -0800

    x86/entry/64: Add missing irqflags tracing to native_load_gs_index()

I'm not sure what commit fixed it off hand.
Comment 15 Alex Deucher 2018-01-23 04:28:39 UTC
It was supposedly fixed in -rc6 so my drm-next branches don't have the fix yet.
Comment 16 dwagner 2018-02-20 23:53:11 UTC
I noticed in the git log of amd-staging-drm-next that multiple patches
were committed that might be related to S3 resumes, so I retried whether
a kernel compiled from the current amd-staging-drm-next head is able to
resume from S3.

Unluckily, the symptoms are unchanged: System crashes upon every S3 resume
attempt - so I'm back to the kernel from last October that resumes fine.

With relevant security issues having been addressed in the kernel
between Oct 17 and now this situation becomes unbearable for me.

Patches that were included in the kernel I tried today:

> commit f8c80313f7a6a8f66b74b118b0e3e5112718e2e5 (HEAD, origin/amd-staging-drm-next)
> Author: Alex Deucher <alexander.deucher@amd.com>
> Date:   Thu Feb 15 08:40:30 2018 -0500
>   Revert "drm/radeon/pm: autoswitch power state when in balanced mode"
>   This reverts commit 1c331f75aa6ccbf64ebcc5a019183e617c9d818a.
>   Breaks resume on some systems.

> commit 734b7ebc0e16b0fb4d2937cc3716c505d7e2c319
> Author: Hersen Wu <hersenxs.wu@amd.com>
> Date:   Tue Jan 30 11:46:16 2018 -0500
> 
>   drm/amd/display: VGA black screen from s3 when attached to hook    
>    [Description] For MST, DC already notify MST sink for MST mode, DC stll
>    check DP SINK DPCD register to see if MST enabled. DP RX firmware may
>    not handle this properly.

> commit adf1c840a6741f1b53ecd6e466e160c725a80641
> Author: Yongqiang Sun <yongqiang.sun@amd.com>
> Date:   Fri Feb 2 17:35:00 2018 -0500
>   drm/amd/display: Keep eDP stream enabled during boot.
>   
>   This path fixed specific eDP panel cold boot black screen
>   due to unnecessary enable link.
>   Change:
>   In case of boot up with eDP, if OS is going to set mode
>   on eDP, keep eDP light up, do not disable and reset corresponding
>   HW.
>   This change may affect dce asics and S3/S4 Resume with multi-monitor.

> commit 8dd8b6bb22fb2470af4e8743f19eabba8127d566
> Author: Charlene Liu <charlene.liu@amd.com>
> Date:   Wed Jan 24 13:18:57 2018 -0500
> 
>  drm/amd/display: resume from S3 bypass power down HW block.

> commit 8d0de6a585e2186734748be2a1043eb3456ed8ed
> Author: Mikita Lipski <mikita.lipski@amd.com>
> Date:   Sat Feb 3 15:19:20 2018 -0500
>   drm/amdgpu: Unify the dm resume calls into one
>  
>   amdgpu_dm_display_resume is now called from dm_resume to
>   unify DAL resume call into a single function call
>   
>   There is no more need to separately call 2 resume functions
>   for DM.
>   
>   Initially they were separated to resume display state after
>   cursor is pinned. But because there is no longer any corruption
>   with the cursor - the calls can be merged into one function hook.
Comment 17 Alex Deucher 2018-02-21 02:12:47 UTC
amd-staging-drm-next is still based on 4.15-rc4 which still has the regression mentioned in comment 14.  Can you try 4.15 final or my drm-next-4.17-wip branch?
Comment 18 dwagner 2018-02-21 21:42:38 UTC
(In reply to Alex Deucher from comment #17)
> amd-staging-drm-next is still based on 4.15-rc4 which still has the
> regression mentioned in comment 14.  Can you try 4.15 final or my
> drm-next-4.17-wip branch?

Just did this - but no change of symptoms, still crash upon every S3 resume attempt.
Comment 19 Mikita Lipski 2018-03-15 15:07:57 UTC
Hi dwagner,

I have attempted to reproduce an issue, but didn't succeed.
We have used an identical hardware setup with a reference rx460 card and couldn't reproduce a hang on various kernel versions.

There is still a possibility that some kernel config options might be causing an issue. 
If its possible, could you please provide kernel configuration so the testing environment is identical.

Also does the issue reproduce reproduce if you disable DC? (amdgpu.dc=0 in GRUB menu)

Thanks 

Nik
Comment 20 dwagner 2018-03-17 11:13:41 UTC
(In reply to mikita.lipski@amd.com from comment #19)
> I have attempted to reproduce an issue, but didn't succeed.
> We have used an identical hardware setup with a reference rx460 card and
> couldn't reproduce a hang on various kernel versions.

Strange. If I didn't keep an older amd-staging-drm-next kernel from October 2017 in use where S3 resumes work every time, I would doubt my hardware - but the way it is the only possible cause is a kernel change sine then.

> There is still a possibility that some kernel config options might be
> causing an issue. 
> If its possible, could you please provide kernel configuration so the
> testing environment is identical.

Sure, will attach my .config (as used with the current drm-next-4.17-wip) after this comment.

> Also does the issue reproduce reproduce if you disable DC? (amdgpu.dc=0 in
> GRUB menu)

Tested with today's drm-next-4.17-wip git: With "amdgpu.dc=0" on the kernel command line, resume from S3 works fine for me, both from the console (before starting X11) and when running X11.
(But of course, no 4k/60Hz video and no HDMI audio with amdgpu.dc=0, so that is not really an option for actual use.)

So the resume-from-S3 failures are definitely caused by something in the DC code that changed since early October 2017.
Comment 21 dwagner 2018-03-17 11:18:52 UTC
Created attachment 138169 [details]
kernel .config as used with git at commit 61deb7d0dddd941d1e3ffee0d799396ac93b0e90 (HEAD, origin/drm-next-4.17-wip)
Comment 22 dwagner 2018-04-24 21:35:07 UTC
Two patches posted by Harry Wentland in bug report
https://bugs.freedesktop.org/show_bug.cgi?id=106159
today may hold the key to get finally rid of this long-standing bug:

When I apply 
https://bugs.freedesktop.org/attachment.cgi?id=139069
and
https://bugs.freedesktop.org/attachment.cgi?id=139070
on top of
"commit ecdd681a62f592e03c3783709526278bdc7ad5cc (HEAD, origin/drm-next-4.18-wip)"
_then_ I can S3-resume without crashing with amdgpu.dc=1

_But_ the system will still crash on S3-resume if I use the
"drm.edid_firmware=edid/LG_EG9609_edid.bin"
kernel command line option (which I would like to use to make X11 mode detection independent of whether the connected TV is off when booting/resuming)
Comment 23 dwagner 2018-04-24 21:39:21 UTC
(I should mention that a very long time ago, I also posted a bug report specifically regarding the EDID-loading feature: https://bugs.freedesktop.org/show_bug.cgi?id=102202)
Comment 24 Harry Wentland 2018-04-25 15:04:47 UTC
Thanks for testing Jerry's patches. We tried reproducing this issue many times but never thought to try MST.

Those two patches are going into 4.17 and 4.18.
Comment 25 Harry Wentland 2018-06-27 15:00:29 UTC
Is this fixed on recent kernels? If so, can we close this one?
Comment 26 dwagner 2018-06-28 19:51:08 UTC
(In reply to Harry Wentland from comment #25)
> Is this fixed on recent kernels? If so, can we close this one?

The fix seems to be included in 4.17.2.

Remaining issue mentioned above: System will still crash on S3-resume if I use the "drm.edid_firmware=edid/LG_EG9609_edid.bin" kernel command line option. (Not as severe as the "crashes under all conditions" issue originally reported.)
Should this be covered by a separate bug report?

Just for reference, a different S3 resume issue on which I just posted a different bug report: https://bugs.freedesktop.org/show_bug.cgi?id=107065
Comment 27 Michel Dänzer 2018-06-29 07:32:11 UTC
(In reply to dwagner from comment #26)
> Should this be covered by a separate bug report?

Yes, please.

Resolving this report, thanks for the follow-up.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.