Bug 111623 - No package C-states on Dell XPS 15 9570 after resume from deep sleep
Summary: No package C-states on Dell XPS 15 9570 after resume from deep sleep
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Don Hiatt
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-10 05:25 UTC by s.zharkoff
Modified: 2019-11-16 10:09 UTC (History)
2 users (show)

See Also:
i915 platform: KBL
i915 features: firmware/guc, firmware/huc, power/runtime PM, power/suspend-resume


Attachments
dmesg before resume (114.36 KB, text/plain)
2019-09-10 06:17 UTC, s.zharkoff
no flags Details
dmesg after resume (114.06 KB, text/plain)
2019-09-10 06:17 UTC, s.zharkoff
no flags Details
v5.3.7-resume-without-guc-submission.patch (593 bytes, patch)
2019-10-24 18:12 UTC, Tomas Janousek
no flags Details | Splinter Review
v5.3.7-resume-without-guc-submission-v2.patch (1.16 KB, patch)
2019-11-15 15:03 UTC, Tomas Janousek
no flags Details | Splinter Review

Description s.zharkoff 2019-09-10 05:25:28 UTC

    
Comment 1 s.zharkoff 2019-09-10 05:29:49 UTC
Movig the bug from kernel bugzilla
https://bugzilla.kernel.org/show_bug.cgi?id=204751

Kernel 5.2 series - suspend, resume - package enter c-states up to C10.
Kernel 5.3 starting from 5.3-rc1 till latest drm-tip builds - states up to C10 before suspend, no package C-states after resume. 

Zhang Rui:

before suspend
CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0
99.40	38	38	99.85	350	1.29	0.79	0.14
after resume
CPU%c7	CoreTmp	PkgTmp	GFX%rc6	GFXMHz	Totl%C0	Any%C0	GFX%C0
99.49	39	39	1.32	350	1.21	1.00	95.04

It is graphics that blocks cpu from entering package c-state.
Please file a bug at freedesktop.org and attach the full dmesg output with kernel parameter drm.debug=0xe, after the problem reproduced.
Comment 2 Lakshmi 2019-09-10 05:52:16 UTC
Is it a duplicate of Bug 111594?
Comment 3 s.zharkoff 2019-09-10 06:17:12 UTC
Created attachment 145312 [details]
dmesg before resume
Comment 4 s.zharkoff 2019-09-10 06:17:41 UTC
Created attachment 145313 [details]
dmesg after resume
Comment 5 s.zharkoff 2019-09-10 06:27:54 UTC
(In reply to Lakshmi from comment #2)
> Is it a duplicate of Bug 111594?

could be, have not tried to do dpms off - for me it is on suspend resume. (In reply to Lakshmi from comment #2)
> Is it a duplicate of Bug 111594?

For me xset dpms force off works fine, C-states are still available until suspend-resume.
Comment 6 Jon Ewins 2019-09-16 15:44:11 UTC
Can you please retry with enable_guc=0 set in the module parameters.  That will disable the use of the GuC and HuC units.  That was seen to help in the case of Bug 111594.
Comment 7 s.zharkoff 2019-09-17 05:39:41 UTC
I've tried guc=0 for 5.3 release. Yes - the c-states are there, but before suspend my laptop spending about 70% in C10. After resume C8 is the lowest state it can reach.  So seems that guc is involved in this, but it is not a root cause...
Comment 8 Don Hiatt 2019-09-20 21:52:58 UTC
(In reply to s.zharkoff from comment #7)
> I've tried guc=0 for 5.3 release. Yes - the c-states are there, but before
> suspend my laptop spending about 70% in C10. After resume C8 is the lowest
> state it can reach.  So seems that guc is involved in this, but it is not a
> root cause...

Thanks for the information, I'm trying to duplicate the issue and will update the fdo once I have some more data. Thank you!
Comment 9 Don Hiatt 2019-10-09 17:49:03 UTC
(In reply to s.zharkoff from comment #7)
> I've tried guc=0 for 5.3 release. Yes - the c-states are there, but before
> suspend my laptop spending about 70% in C10. After resume C8 is the lowest
> state it can reach.  So seems that guc is involved in this, but it is not a
> root cause...

Hello, are you still seeing this behavior on 5.3 (or with later releases)? Would it be possible for you to repeat the test with the i915 driver removed?
Comment 10 s.zharkoff 2019-10-13 05:27:00 UTC
Latest drm-tip looks fine so far.
Comment 11 s.zharkoff 2019-10-16 12:26:43 UTC
Checked again - it works only if GUC is not loaded, drm-tip is made to read guc 33.0.0 which i did not have so it looked for me that it is OK. But when I've put guc 33.0.0 in place still no c-states under resume.

It is not specific to Dell - tested same for LG Gram 13z980, same story
Comment 12 Don Hiatt 2019-10-16 16:51:24 UTC
(In reply to s.zharkoff from comment #11)
> Checked again - it works only if GUC is not loaded, drm-tip is made to read
> guc 33.0.0 which i did not have so it looked for me that it is OK. But when
> I've put guc 33.0.0 in place still no c-states under resume.
> 
> It is not specific to Dell - tested same for LG Gram 13z980, same story

Thanks for the info wrt to the firmware version, maybe that's why I've been unable to duplicate. I'll see what I find. Thank you.
Comment 13 Don Hiatt 2019-10-16 17:28:33 UTC
I have KBL runing drmtip 5.4.0 and only enable_guc=1 loads without any error messages. Are you seeing this behavior as well? Which enable_guc option are you running with?

---------------------------
Linux dhiatt-nuc 5.4.0-rc2+ #1 SMP Tue Oct 15 16:04:06 PDT 2019 x86_64 x86_64 x86_64 GNU/Linux


---------------------------
modprobe i915
[  902.352519] i915 0000:00:02.0: Found 64MB of eDRAM
[  902.352529] i915 0000:00:02.0: vgaarb: deactivate vga console
[  902.353679] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[  902.353679] [drm] Driver supports precise vblank timestamp query.
[  902.354489] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[  902.406688] i915 0000:00:02.0: GuC initialization failed -1
[  902.406692] i915 0000:00:02.0: Enabling uc failed (-5)
[  902.406698] i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged!
[  902.432230] [drm] Initialized i915 1.6.0 20191007 for 0000:00:02.0 on minor 0

---------------------------
modprobe i915 enable_guc=1
[  991.862133] Setting dangerous option enable_guc - tainting kernel
[  991.862523] i915 0000:00:02.0: Incompatible option enable_guc=1 - GuC submission is N/A
[  991.862604] i915 0000:00:02.0: Found 64MB of eDRAM
[  991.862617] i915 0000:00:02.0: vgaarb: deactivate vga console
[  991.864535] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[  991.864536] [drm] Driver supports precise vblank timestamp query.
[  991.865582] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[  991.920772] [drm] GuC communication enabled
[  991.920808] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[  991.922133] [drm] Initialized i915 1.6.0 20191007 for 0000:00:02.0 on minor 0

---------------------------
modprobe i915 enable_guc=2
[ 1269.509954] i915 0000:00:02.0: WOPCM: no space for HuC: 16K < 236K
[ 1269.510101] i915 0000:00:02.0: Unsuccessful WOPCM partitioning
[ 1269.510107] i915 0000:00:02.0: GuC initialization failed -7
[ 1269.510109] i915 0000:00:02.0: Enabling uc failed (-5)
[ 1269.510116] i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged!

---------------------------
modprobe i915 enable_guc=3

[  749.106732] Setting dangerous option enable_guc - tainting kernel
[  749.107060] i915 0000:00:02.0: Incompatible option enable_guc=3 - GuC submission is N/A
[  749.107134] i915 0000:00:02.0: Found 64MB of eDRAM
[  749.107149] i915 0000:00:02.0: vgaarb: deactivate vga console
[  749.108107] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[  749.108108] [drm] Driver supports precise vblank timestamp query.
[  749.109099] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[  749.160888] i915 0000:00:02.0: WOPCM: no space for HuC: 16K < 236K
[  749.161040] i915 0000:00:02.0: Unsuccessful WOPCM partitioning
[  749.161047] i915 0000:00:02.0: GuC initialization failed -7
[  749.161048] i915 0000:00:02.0: Enabling uc failed (-5)
[  749.161054] i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged!
Comment 14 Don Hiatt 2019-10-16 17:37:19 UTC
Actually the modprobe i915 without any options seems okay after a reboot.
Comment 15 Don Hiatt 2019-10-16 17:41:58 UTC
(In reply to Don Hiatt from comment #14)
> Actually the modprobe i915 without any options seems okay after a reboot.

The 'modprobe i915' failure can be duplicated by loading the driver with enable_guc=2, unloading the driver, and then 'modprobe i915'.
Comment 16 Don Hiatt 2019-10-16 19:16:53 UTC
Okay, I can duplicate this issue when enable_guc=3, looks like there is an issue with the 'kbl_guc_33.0.0.bin' firmware. 

[dhiatt ~]$ sudo ./turbostat --quiet --show CPU%c7,GFX%rc6,Totl%C0,Any%C0,GFX%C0
CPU%c7  GFX%rc6 Totl%C0 Any%C0  GFX%C0
99.16   100.70  1.25    0.98    0.00
99.08   100.71  1.25    0.98    0.00
99.24
CPU%c7  GFX%rc6 Totl%C0 Any%C0  GFX%C0
99.58   100.74  0.59    0.50    0.00
99.50   100.74  0.59    0.50    0.00
99.66

[dhiatt ~]$ sudo rtcwake -s 30 -m disk
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "disk" using /dev/rtc0 at Wed Oct 16 19:18:21 2019

[dhiatt ~]$ sudo ./turbostat --quiet --show CPU%c7,GFX%rc6,Totl%C0,Any%C0,GFX%C0
CPU%c7  GFX%rc6 Totl%C0 Any%C0  GFX%C0
99.68   4.81    0.44    0.40    81.71
99.68   4.81    0.44    0.40    81.71
99.68
CPU%c7  GFX%rc6 Totl%C0 Any%C0  GFX%C0
99.75   4.83    0.32    0.30    81.72
99.75   4.83    0.32    0.30    81.72
99.76

[  349.858510] Setting dangerous option enable_guc - tainting kernel
[  349.858813] i915 0000:00:02.0: Incompatible option enable_guc=3 - GuC submission is N/A
[  349.858873] i915 0000:00:02.0: Found 64MB of eDRAM
[  349.858884] i915 0000:00:02.0: vgaarb: deactivate vga console
[  349.860277] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[  349.860278] [drm] Driver supports precise vblank timestamp query.
[  349.861550] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[  349.921602] [drm] GuC communication enabled
[  349.925704] i915 0000:00:02.0: GuC firmware i915/kbl_guc_33.0.0.bin version 33.0 submission:disabled
[  349.925705] i915 0000:00:02.0: HuC firmware i915/kbl_huc_4.0.0.bin version 4.0 authenticated:yes
Comment 17 Don Hiatt 2019-10-16 19:56:36 UTC
I tried 'kbl_guc_32.0.3.bin' and it behaves the same way.
Comment 18 Don Hiatt 2019-10-16 20:14:33 UTC
It looks like once the driver is loaded with enable_guc=3, reloading the driver with enable_guc=0 always fails and requires a reboot. This is what I was running into earlier today.

Also, once the driver is loaded with enable_guc=0 we never get into rc6 as the gpu is wedged. The only fix is to reboot or to load the driver with enable_guc=3.

Same behavior with 'kbl_guc_32.0.3.bin'.

Looks to be a firmware issue.

--[load driver with enable_guc=3]---
[   39.892295] [drm] GuC communication stopped <---- driver removed

--[reload driver with enable_guc=0]---

[   57.712627] Setting dangerous option enable_guc - tainting kernel
[   57.712988] i915 0000:00:02.0: Found 64MB of eDRAM
[   57.712998] i915 0000:00:02.0: vgaarb: deactivate vga console
[   57.714338] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[   57.714339] [drm] Driver supports precise vblank timestamp query.
[   57.715451] [drm] Finished loading DMC firmware i915/kbl_dmc_ver1_04.bin (v1.4)
[   57.769281] i915 0000:00:02.0: GuC initialization failed -1
[   57.769285] i915 0000:00:02.0: Enabling uc failed (-5)
[   57.769291] i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged!
Comment 19 Tomas Janousek 2019-10-24 08:24:14 UTC
Reproducible on ThinkPad T25 (T470 equivalent) as well.
Comment 20 Tomas Janousek 2019-10-24 17:03:55 UTC
Turns out I can reproduce this with i915/kbl_guc_ver9_39.bin + i915/kbl_huc_ver02_00_1810.bin...

I bisected this to [cc80b2ef24898dc62242e194270200b01ca758d1] drm/i915/guc: Change platform default GuC mode -- I've been running with enable_guc=-1. If I had been running with enable_guc=3, I guess it would've bisected to [a2904ade3dc28cf1a1b7deded41f4369f75e664c] drm/i915/guc: Don't allow GuC submission.

Anyway, it seems that on this Kaby Lake system, loading HuC without GuC submission leads to the GPU not entering rc6 and thus keeping the package out of C-states. Obviously then it doesn't matter what GuC version it is, be it 9.39 or 32.0.3 or whatever.

Dunno what the solution is. Judging from the cc80b2ef24898dc62242e194270200b01ca758d1 commit message, I'm guessing it might be appropriate to revert the GuC firmware versions to 9.39 for 5.3?
Comment 21 Don Hiatt 2019-10-24 17:08:16 UTC
(In reply to Tomas Janousek from comment #20)
> Turns out I can reproduce this with i915/kbl_guc_ver9_39.bin +
> i915/kbl_huc_ver02_00_1810.bin...
> 
> Anyway, it seems that on this Kaby Lake system, loading HuC without GuC
> submission leads to the GPU not entering rc6 and thus keeping the package
> out of C-states. Obviously then it doesn't matter what GuC version it is, be
> it 9.39 or 32.0.3 or whatever.

Pretty much what I concluded, I just posted this for review, can you give it a try? Thanks. 

https://patchwork.freedesktop.org/patch/337552/
Comment 22 Tomas Janousek 2019-10-24 18:12:34 UTC
Created attachment 145808 [details] [review]
v5.3.7-resume-without-guc-submission.patch

(In reply to Don Hiatt from comment #21)
> Pretty much what I concluded, I just posted this for review, can you give it
> a try? Thanks. 
> 
> https://patchwork.freedesktop.org/patch/337552/

I backported this to 5.3.7 (attached) and yeah, it does fix the issue. I disagree with the patch description, though: the function is definitely being called with enable_guc=-1, otherwise I wouldn't need this fix.

(And it does seem a bit weird to skip resume but not skip suspend, but I have no idea what this non-free blobby whatever does, so I guess I'll leave that up to you guys.)
Comment 23 Don Hiatt 2019-10-24 18:18:48 UTC
(In reply to Tomas Janousek from comment #22)
> Created attachment 145808 [details] [review] [review]
> v5.3.7-resume-without-guc-submission.patch
> 
> (In reply to Don Hiatt from comment #21)
> > Pretty much what I concluded, I just posted this for review, can you give it
> > a try? Thanks. 
> > 
> > https://patchwork.freedesktop.org/patch/337552/
> 
> I backported this to 5.3.7 (attached) and yeah, it does fix the issue. I
> disagree with the patch description, though: the function is definitely
> being called with enable_guc=-1, otherwise I wouldn't need this fix.
> 
> (And it does seem a bit weird to skip resume but not skip suspend, but I
> have no idea what this non-free blobby whatever does, so I guess I'll leave
> that up to you guys.)

Thanks for test this, Tomas. Yeah, I'll fix up the commit message.
Comment 24 s.zharkoff 2019-10-28 16:42:44 UTC
I've applied the patch from patchwor.freedesktop.org over drm-tip tree and use it 3 days already - suspend, resume, hibernate - so far so good, all c-states are reached.
Comment 25 Don Hiatt 2019-10-28 17:00:25 UTC
(In reply to s.zharkoff from comment #24)
> I've applied the patch from patchwor.freedesktop.org over drm-tip tree and
> use it 3 days already - suspend, resume, hibernate - so far so good, all
> c-states are reached.

Good to hear! Hopefully I can get a review and this merged soon.
Comment 26 s.zharkoff 2019-10-30 11:38:10 UTC
Compiled latest drm-tip and it suspend-resume well without GUC submission patch applied.
Comment 27 Don Hiatt 2019-10-30 17:19:29 UTC
(In reply to s.zharkoff from comment #26)
> Compiled latest drm-tip and it suspend-resume well without GUC submission
> patch applied.

Is that with guc_enable=-1 ? We'll still push the changes (once the review is complete) as it's the proper thing to do anyhow. On my KBL, the latest drm-tip still requires the change.
Comment 28 Tomas Janousek 2019-11-15 15:03:53 UTC
Created attachment 145969 [details] [review]
v5.3.7-resume-without-guc-submission-v2.patch

Don, I'm wondering if you happen to have any idea what's delaying this? Are the folks at Intel too busy dealing with the vulns as of late? If that's the case, I totally understand. Otherwise it's a bit sad that fixing a regression is not a priority.

Anyway, I'm attaching a patch that I've been running for a few weeks and that I'd really love to see merged to some 5.3.x stable. Do feel free to add

  Tested-by: Tomas Janousek <tomi@nomi.cz>

to any functionally equivalent version of the patch, if it helps to get this merged. Is there any other way I can help?
Comment 29 Don Hiatt 2019-11-15 16:51:20 UTC
Hi Tomas,

I pushed a new revision to the mailing list yesterday [1]. It's functionally the same patch, only how it determines if the GuC is active is different -- just to future proof for things that are in flight. 

Bugs are important to us, I got busy with another task and didn't have time to post the revision until yesterday. Sorry about that.

Thanks for testing and I appreciate your comments.

don

[1] https://patchwork.freedesktop.org/series/69499/
Comment 30 Chris Wilson 2019-11-16 10:09:27 UTC
commit 82e0c5bbd6eb1d274b5a3e519ff0ab91f1f8e537 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Don Hiatt <don.hiatt@intel.com>
Date:   Fri Nov 15 15:15:38 2019 -0800

    drm/i915/guc: Skip suspend/resume GuC action on platforms w/o GuC submission
    
    On some platforms (e.g. KBL) that do not support GuC submission, but
    the user enabled the GuC communication (e.g for HuC authentication)
    calling the GuC EXIT_S_STATE action results in lose of ability to
    enter RC6. We can remove the GuC suspend/resume entirely as we do
    not need to save the GuC submission status.
    
    Add intel_guc_submission_is_enabled() function to determine if
    GuC submission is active.
    
    v2: Do not suspend/resume the GuC on platforms that do not support
        Guc Submission.
    v3: Fix typo, move suspend logic to remove goto.
    v4: Use intel_guc_submission_is_enabled() to check GuC submission
        status.
    v5: No need to look at engine to determine if submission is enabled.
        Squash fix + intel_guc_submission_is_enabled() patch into one.
    v6: Move resume check into intel_guc_resume() for symmetry.
        Fix commit Fixes tag.
    
    Reported-by: KiteStramuort <kitestramuort@autistici.org>
    Reported-by: S. Zharkoff <s.zharkoff@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111594
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111623
    Fixes: ffd5ce22faa4 ("drm/i915/guc: Updates for GuC 32.0.3 firmware")
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Cc: Daniele Ceralo Spurio <daniele.ceraolospurio@intel.com>
    Cc: Stuart Summers <stuart.summers@intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Tested-by: Tomas Janousek <tomi@nomi.cz>
    Signed-off-by: Don Hiatt <don.hiatt@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191115231538.1249-1-don.hiatt@intel.com


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.