Bug 92414 - [Intel-gfx] As of kernel 4.3-rc1 system will not stay in S3 suspend [REGRESSION][BISTECTED]
Summary: [Intel-gfx] As of kernel 4.3-rc1 system will not stay in S3 suspend [REGRESSI...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: highest blocker
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords: bisected, regression
Depends on:
Blocks:
 
Reported: 2015-10-10 17:26 UTC by Jairo Miramontes
Modified: 2017-07-25 08:45 UTC (History)
5 users (show)

See Also:
i915 platform: ALL
i915 features: power/suspend-resume


Attachments
requested dmesg with drm.debug=14 (197.82 KB, text/plain)
2015-10-14 15:49 UTC, Doug Smythies
no flags Details
an edited version of the file the previous attachment asked for (15.41 KB, text/plain)
2015-10-14 15:51 UTC, Doug Smythies
no flags Details
git bisect result (3.56 KB, text/plain)
2017-02-09 13:54 UTC, Dorota Czaplejewicz
no flags Details
first bad commit dmesg (62.12 KB, text/plain)
2017-02-09 13:56 UTC, Dorota Czaplejewicz
no flags Details

Description Jairo Miramontes 2015-10-10 17:26:57 UTC
> This started somewhere between Kernel 4.2 and 4.3-rc1,
> but I only noticed it a day ago.
> 
> The first S3 suspend after a fresh boot works fine.
> Thereafter, suspends simply resume again immediately.
> 
> I get the following errors on my console:
> 
> [  152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail
> [  152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
> [  152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11
> [  152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11
> [  152.697306] PM: Some devices failed to suspend, or early wake event detected
> 
> The issue is not limited to my normal way of doing suspend, using "pm-suspend".
> It also happens using the "echo mem > /sys/power/state" method.
> 
> The kernel was bisected, and the result was double checked by clean compiles
> of the first bad commit and the immediately preceding commit. Bisect results
> copied below:
> 
> $ git bisect good
> dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit
> commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2
> Author: John Harrison <John.C.Harrison at Intel.com>
> Date:   Fri May 29 17:43:39 2015 +0100
> 
>     drm/i915: Add explicit request management to i915_gem_init_hw()
> 
>     Now that a single per ring loop is being done for all the different
>     intialisation steps in i915_gem_init_hw(), it is possible to add proper request
>     management as well. The last remaining issue is that the context enable call
>     eventually ends up within *_render_state_init() and this does its own private
>     _i915_add_request() call.
> 
>     This patch adds explicit request creation and submission to the top level loop
>     and removes the add_request() from deep within the sub-functions.
> 
>     v2: Updated for removal of batch_obj from add_request call in previous patch.
> 
>     For: VIZ-5115
>     Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
>     Reviewed-by: Tomas Elf <tomas.elf at intel.com>
>     Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> 
> :040000 040000 789c630ff3f5f07238a5df1bde79187c6c1251d0 2da3f7e20e2642d8eebd9f72528923c2ac53a8cb M      drivers
Comment 1 Jairo Miramontes 2015-10-10 17:30:51 UTC
This bug was created for tracking purposes, was reported to the intel gfx list,  refeer to http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html
Comment 2 Daniel Vetter 2015-10-12 07:04:34 UTC
As discussed please follow up on the m-l with a link to each regression tracking bug you create so that the links go both ways. Thanks.
Comment 3 Jani Nikula 2015-10-12 14:40:03 UTC
Please use links that contain the Message-ID so that it's easier to find the messages in email. Please reference the original report.

Like this: http://mid.gmane.org/002301d1025d$d5765090$8062f1b0$@net
Comment 4 Jani Nikula 2015-10-12 14:41:02 UTC
John, ideas?
Comment 5 Doug Smythies 2015-10-12 22:05:14 UTC
Additional information:

After the first resume from suspend, the processor is in a bizarre state, where it will not go below 2.4 GHz, even though every CPU is asking for a pstate of 16 (the minimum for my processor). This has been tested several times, on both the preceding (good) and first bad kernels using both methods of suspend.

My processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz

Example (no load):

pstate being asked for:

# rdmsr --bitfield 15:8 -d -a 0x199
16
16
16
16
16
16
16
16

pstate that I am getting:

# rdmsr --bitfield 15:8 -d -a 0x198
24
24
24
24
24
24
24
24

CPU freqs:

# grep MHz /proc/cpuinfo
cpu MHz : 2400.054
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.789
cpu MHz : 2399.789
cpu MHz : 2399.921
cpu MHz : 2399.921
Comment 6 Jani Nikula 2015-10-13 07:53:55 UTC
Did you double check the bisect by running both dc4be6071a24 and dc4be6071a24^ ?
Comment 7 Doug Smythies 2015-10-13 15:50:13 UTC
(In reply to Jani Nikula from comment #6)
> Did you double check the bisect by running both dc4be6071a24 and
> dc4be6071a24^ ?

Yes, of course, and I said so in my initial e-mail.
Truth be known, this was my second bisection, as I must have made a mistake in my first attempt, because the double check failed.
Comment 8 Jani Nikula 2015-10-14 07:29:11 UTC
(In reply to Doug Smythies from comment #7)
> (In reply to Jani Nikula from comment #6)
> > Did you double check the bisect by running both dc4be6071a24 and
> > dc4be6071a24^ ?
> 
> Yes, of course, and I said so in my initial e-mail.
> Truth be known, this was my second bisection, as I must have made a mistake
> in my first attempt, because the double check failed.

I asked, because I suspected the bisect result might be wrong. And the symptoms in comment #5 seem odd.

Please try two things: First, run dc4be6071a24 and try suspend/resume several times, and see if it's 100% reproducible or not. Second, attach dmesg with drm.debug=14 module parameter set (for the failing case).
Comment 9 Doug Smythies 2015-10-14 15:49:29 UTC
Created attachment 118873 [details]
requested dmesg with drm.debug=14

Possibly relevant excerpt:

[  399.518389] [drm] stuck on render ring
[  399.518686] [drm] GPU HANG: ecode 6:0:0xfeffffff, reason: Ring hung, action: reset
[  399.518686] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  399.518686] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  399.518687] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  399.518687] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  399.518687] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  399.518699] [drm:i915_reset_and_wakeup] resetting chip
[  399.518724] i915 0000:00:02.0: GEM idle failed, resume might fail
[  399.518737] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
[  399.518739] dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -11
[  399.518741] PM: Device 0000:00:02.0 failed to suspend async: error -11
[  399.518804] PM: Some devices failed to suspend, or early wake event detected

an edited /sys/class/drm/card0/error will be attached in a moment.
Comment 10 Doug Smythies 2015-10-14 15:51:10 UTC
Created attachment 118874 [details]
an edited version of the file the previous attachment asked for

I edited just to remove many lines of 0's.
Comment 11 Doug Smythies 2015-10-14 18:22:19 UTC
(In reply to Jani Nikula from comment #8)
> First, run dc4be6071a24 and try suspend/resume
> several times, and see if it's 100% reproducible or not.

Yes, it happens every time. To say 100%, I would have to have a sample space of about 1000 attempts. I did not do that many.
Comment 12 Doug Smythies 2015-10-15 23:40:55 UTC
Additional information:

After a fresh boot with the bad kernel, turbostat shows: Pkg%pc6 = 97.84%; PkgWatt 4.01; CorWatt 0.28; GFXWatt 0.23.

Then after the first suspend resume, turbostat shows: Pkg%pc6 = 0.00%; PkgWatt 10.08; CorWatt 3.04; GFXWatt 3.51.

PkgTmp goes up by more than 10 degrees.

The system is idle is both cases: 0.03% busy.
Comment 13 Johannes 2015-11-17 12:28:06 UTC
I'm having the same problem with 4.3, skipped few versions before that. However, I seem to have found an earlier bug that could be related, at least it sounds similar: https://bugs.freedesktop.org/show_bug.cgi?id=90253
Comment 14 Doug Smythies 2016-01-06 07:42:31 UTC
This issue persists though kernel 4.4-rc8.
Comment 15 tigrangab 2016-01-16 19:25:04 UTC
Bug is still valid in 4.4 release.
Comment 16 Doug Smythies 2016-01-17 19:51:00 UTC
Yes, the bug persists through kernel 4.4.

Having isolated the issue down to the exact causal commit, I do not know what else I can do to move this one along.
Comment 17 tigrangab 2016-02-24 18:04:19 UTC
Just an update: bug still exists in 4.5 rc5.
Comment 18 tigrangab 2016-03-28 05:10:01 UTC
Bug still exists in 4.6 rc1. Is there anything I can provide to help with this issue?
Comment 19 Doug Smythies 2016-03-29 05:48:43 UTC
This issue no longer occurs on my computer.

The a fresh install on linux was done, and now the system is using systemd whereas previously it was not. I am not certain systemd is the difference, it is just my best guess.
Comment 20 Jani Nikula 2016-04-22 14:24:54 UTC
(In reply to tigrangab from comment #18)
> Bug still exists in 4.6 rc1. Is there anything I can provide to help with
> this issue?

Did you bisect this to the same commit reported in comment #0?
Comment 21 Jari Tahvanainen 2016-12-19 11:00:09 UTC
Highest+Blocker as being regression w/o workaround
Comment 22 Dorota Czaplejewicz 2017-01-30 17:46:12 UTC
Tested on drm-tip on IVB-3770, but the issue didn't appear: all suspends and resumes are fine.
Comment 23 Jari Tahvanainen 2017-01-31 09:58:33 UTC
tigrangab@gmail.com, can you check if failure still persist with latest drm-tip? For others it seem to be resolved, see comment 19 and comment 22.
Comment 24 Doug Smythies 2017-02-01 01:06:24 UTC
While in comment 19 above, I mentioned that this issue no longer occurred on my computer, I did try to go back and re-install an older version of my distribution (Ubuntu) on another partition in an attempt to re-create the issue. I was unsuccessful. I tried without success again today.

My original work and kernel bisection was good and repeatable. I do not understand why I can not re-create the failure scenario now. I can only assume it is because I did not install from the exact same iso starting point. The only hardware change was the hard disk that had failed.

In comment 13 there is a reference to a bug with similar symptoms (https://bugs.freedesktop.org/show_bug.cgi?id=90253). However it can not be the same root issue, because, if I understand the dates correctly, the commit that this was isolated to did not exist when that bug report was entered.
Comment 25 Dorota Czaplejewicz 2017-02-01 15:37:19 UTC
I have to update my comment - probably I didn't check the correct kernel, but the issue mysteriously appeared between 4.9.0 (69973b830859bc6529a7a0468ba0d80ee5117826) and 4.10.0-rc6 from drm-tip: 2017y-02m-01d-11h-09m-17s UTC (eb9b7b42023edc1b5849d1ff3bef490b492067a3).

The system seems to wake up immediately, and there's nothing special in dmesg, even though kernel command line includes drm.debug=0x1f

$ cat /sys/power/state 
freeze mem disk
$ echo 'mem' > /sys/power/state
bash: echo: write error: Resource temporarily unavailable
$ dmesg | tail
[  174.113592] systemd-journald[521]: Failed to set ACL on /var/log/journal/fe605962ccdd4f5dafb1348d1329bf81/user-1000.journal, ignoring: Operation not supported
[  209.294939] PM: Syncing filesystems ... done.
[  209.346235] PM: Preparing system for sleep (mem)

System:
Fedora 24
i7-3770 CPU @ 3.40GHz
Intel HD 4000
Kernel config used: https://intel-gfx-ci.01.org/CI/CI_DRM_2133/kernel.config.bz2
Comment 26 Dorota Czaplejewicz 2017-02-09 13:52:29 UTC
I managed to get more info about the issue I'm seeing. The symptoms have been mostly consistent with my previous post.

The failure to sleep does not happen every time; I had to reboot up to 5 times for the first failure to happen. Because of that, I'm not 100% certain if "good" commits are really bug-free - I tested at most 5 reboots. "Bad" commits are definitely correct though.

Suspend failure is somewhat correlated to failures in dmesg, like:
[   10.287387] BUG: unable to handle kernel paging request at ffffffffa041d82
8

This commit came out as bad:
commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24
Merge: a2cd64f 2cfe8f8
Author: David S. Miller <davem@davemloft.net>
Date:   Sun Jan 8 22:01:22 2017 -0500

    Merge branch 'bcm_sf2-fixes'
Comment 27 Dorota Czaplejewicz 2017-02-09 13:54:09 UTC
Created attachment 129436 [details]
git bisect result

good commits are those which survived 5 warm reboots without failing to suspend
Comment 28 Dorota Czaplejewicz 2017-02-09 13:56:07 UTC
Created attachment 129437 [details]
first bad commit dmesg
Comment 29 Jani Nikula 2017-02-09 14:58:01 UTC
(In reply to Dorota Czaplejewicz from comment #26)
> This commit came out as bad:
> commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24
> Merge: a2cd64f 2cfe8f8
> Author: David S. Miller <davem@davemloft.net>
> Date:   Sun Jan 8 22:01:22 2017 -0500
> 
>     Merge branch 'bcm_sf2-fixes'

And that one has nothing to do with Intel graphics...
Comment 30 Jari Tahvanainen 2017-02-17 08:12:36 UTC
Based on comment 24, comment 26 and comment 29 I would propose this to be closed. Should we pass this bug to other product+component or even another bugzilla?
Comment 31 Doug Smythies 2017-02-17 15:14:24 UTC
Agree, it should be closed.
Comment 32 yann 2017-02-17 17:28:46 UTC
Thanks Doug Smythies for your confirmation


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.