> This started somewhere between Kernel 4.2 and 4.3-rc1,
> but I only noticed it a day ago.
> The first S3 suspend after a fresh boot works fine.
> Thereafter, suspends simply resume again immediately.
> I get the following errors on my console:
> [ 152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail
> [ 152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
> [ 152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11
> [ 152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11
> [ 152.697306] PM: Some devices failed to suspend, or early wake event detected
> The issue is not limited to my normal way of doing suspend, using "pm-suspend".
> It also happens using the "echo mem > /sys/power/state" method.
> The kernel was bisected, and the result was double checked by clean compiles
> of the first bad commit and the immediately preceding commit. Bisect results
> copied below:
> $ git bisect good
> dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit
> commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2
> Author: John Harrison <John.C.Harrison at Intel.com>
> Date: Fri May 29 17:43:39 2015 +0100
> drm/i915: Add explicit request management to i915_gem_init_hw()
> Now that a single per ring loop is being done for all the different
> intialisation steps in i915_gem_init_hw(), it is possible to add proper request
> management as well. The last remaining issue is that the context enable call
> eventually ends up within *_render_state_init() and this does its own private
> _i915_add_request() call.
> This patch adds explicit request creation and submission to the top level loop
> and removes the add_request() from deep within the sub-functions.
> v2: Updated for removal of batch_obj from add_request call in previous patch.
> For: VIZ-5115
> Signed-off-by: John Harrison <John.C.Harrison at Intel.com>
> Reviewed-by: Tomas Elf <tomas.elf at intel.com>
> Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch>
> :040000 040000 789c630ff3f5f07238a5df1bde79187c6c1251d0 2da3f7e20e2642d8eebd9f72528923c2ac53a8cb M drivers
This bug was created for tracking purposes, was reported to the intel gfx list, refeer to http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html
As discussed please follow up on the m-l with a link to each regression tracking bug you create so that the links go both ways. Thanks.
Please use links that contain the Message-ID so that it's easier to find the messages in email. Please reference the original report.
Like this: http://mid.gmane.org/002301d1025d$d5765090$8062f1b0$@net
After the first resume from suspend, the processor is in a bizarre state, where it will not go below 2.4 GHz, even though every CPU is asking for a pstate of 16 (the minimum for my processor). This has been tested several times, on both the preceding (good) and first bad kernels using both methods of suspend.
My processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
Example (no load):
pstate being asked for:
# rdmsr --bitfield 15:8 -d -a 0x199
pstate that I am getting:
# rdmsr --bitfield 15:8 -d -a 0x198
# grep MHz /proc/cpuinfo
cpu MHz : 2400.054
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.921
cpu MHz : 2399.789
cpu MHz : 2399.789
cpu MHz : 2399.921
cpu MHz : 2399.921
Did you double check the bisect by running both dc4be6071a24 and dc4be6071a24^ ?
(In reply to Jani Nikula from comment #6)
> Did you double check the bisect by running both dc4be6071a24 and
> dc4be6071a24^ ?
Yes, of course, and I said so in my initial e-mail.
Truth be known, this was my second bisection, as I must have made a mistake in my first attempt, because the double check failed.
(In reply to Doug Smythies from comment #7)
> (In reply to Jani Nikula from comment #6)
> > Did you double check the bisect by running both dc4be6071a24 and
> > dc4be6071a24^ ?
> Yes, of course, and I said so in my initial e-mail.
> Truth be known, this was my second bisection, as I must have made a mistake
> in my first attempt, because the double check failed.
I asked, because I suspected the bisect result might be wrong. And the symptoms in comment #5 seem odd.
Please try two things: First, run dc4be6071a24 and try suspend/resume several times, and see if it's 100% reproducible or not. Second, attach dmesg with drm.debug=14 module parameter set (for the failing case).
Created attachment 118873 [details]
requested dmesg with drm.debug=14
Possibly relevant excerpt:
[ 399.518389] [drm] stuck on render ring
[ 399.518686] [drm] GPU HANG: ecode 6:0:0xfeffffff, reason: Ring hung, action: reset
[ 399.518686] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 399.518686] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 399.518687] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 399.518687] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 399.518687] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 399.518699] [drm:i915_reset_and_wakeup] resetting chip
[ 399.518724] i915 0000:00:02.0: GEM idle failed, resume might fail
[ 399.518737] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11
[ 399.518739] dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -11
[ 399.518741] PM: Device 0000:00:02.0 failed to suspend async: error -11
[ 399.518804] PM: Some devices failed to suspend, or early wake event detected
an edited /sys/class/drm/card0/error will be attached in a moment.
Created attachment 118874 [details]
an edited version of the file the previous attachment asked for
I edited just to remove many lines of 0's.
(In reply to Jani Nikula from comment #8)
> First, run dc4be6071a24 and try suspend/resume
> several times, and see if it's 100% reproducible or not.
Yes, it happens every time. To say 100%, I would have to have a sample space of about 1000 attempts. I did not do that many.
After a fresh boot with the bad kernel, turbostat shows: Pkg%pc6 = 97.84%; PkgWatt 4.01; CorWatt 0.28; GFXWatt 0.23.
Then after the first suspend resume, turbostat shows: Pkg%pc6 = 0.00%; PkgWatt 10.08; CorWatt 3.04; GFXWatt 3.51.
PkgTmp goes up by more than 10 degrees.
The system is idle is both cases: 0.03% busy.
I'm having the same problem with 4.3, skipped few versions before that. However, I seem to have found an earlier bug that could be related, at least it sounds similar: https://bugs.freedesktop.org/show_bug.cgi?id=90253
This issue persists though kernel 4.4-rc8.
Bug is still valid in 4.4 release.
Yes, the bug persists through kernel 4.4.
Having isolated the issue down to the exact causal commit, I do not know what else I can do to move this one along.
Just an update: bug still exists in 4.5 rc5.
Bug still exists in 4.6 rc1. Is there anything I can provide to help with this issue?
This issue no longer occurs on my computer.
The a fresh install on linux was done, and now the system is using systemd whereas previously it was not. I am not certain systemd is the difference, it is just my best guess.
(In reply to tigrangab from comment #18)
> Bug still exists in 4.6 rc1. Is there anything I can provide to help with
> this issue?
Did you bisect this to the same commit reported in comment #0?
Highest+Blocker as being regression w/o workaround
Tested on drm-tip on IVB-3770, but the issue didn't appear: all suspends and resumes are fine.
firstname.lastname@example.org, can you check if failure still persist with latest drm-tip? For others it seem to be resolved, see comment 19 and comment 22.
While in comment 19 above, I mentioned that this issue no longer occurred on my computer, I did try to go back and re-install an older version of my distribution (Ubuntu) on another partition in an attempt to re-create the issue. I was unsuccessful. I tried without success again today.
My original work and kernel bisection was good and repeatable. I do not understand why I can not re-create the failure scenario now. I can only assume it is because I did not install from the exact same iso starting point. The only hardware change was the hard disk that had failed.
In comment 13 there is a reference to a bug with similar symptoms (https://bugs.freedesktop.org/show_bug.cgi?id=90253). However it can not be the same root issue, because, if I understand the dates correctly, the commit that this was isolated to did not exist when that bug report was entered.
I have to update my comment - probably I didn't check the correct kernel, but the issue mysteriously appeared between 4.9.0 (69973b830859bc6529a7a0468ba0d80ee5117826) and 4.10.0-rc6 from drm-tip: 2017y-02m-01d-11h-09m-17s UTC (eb9b7b42023edc1b5849d1ff3bef490b492067a3).
The system seems to wake up immediately, and there's nothing special in dmesg, even though kernel command line includes drm.debug=0x1f
$ cat /sys/power/state
freeze mem disk
$ echo 'mem' > /sys/power/state
bash: echo: write error: Resource temporarily unavailable
$ dmesg | tail
[ 174.113592] systemd-journald: Failed to set ACL on /var/log/journal/fe605962ccdd4f5dafb1348d1329bf81/user-1000.journal, ignoring: Operation not supported
[ 209.294939] PM: Syncing filesystems ... done.
[ 209.346235] PM: Preparing system for sleep (mem)
i7-3770 CPU @ 3.40GHz
Intel HD 4000
Kernel config used: https://intel-gfx-ci.01.org/CI/CI_DRM_2133/kernel.config.bz2
I managed to get more info about the issue I'm seeing. The symptoms have been mostly consistent with my previous post.
The failure to sleep does not happen every time; I had to reboot up to 5 times for the first failure to happen. Because of that, I'm not 100% certain if "good" commits are really bug-free - I tested at most 5 reboots. "Bad" commits are definitely correct though.
Suspend failure is somewhat correlated to failures in dmesg, like:
[ 10.287387] BUG: unable to handle kernel paging request at ffffffffa041d82
This commit came out as bad:
Merge: a2cd64f 2cfe8f8
Author: David S. Miller <email@example.com>
Date: Sun Jan 8 22:01:22 2017 -0500
Merge branch 'bcm_sf2-fixes'
Created attachment 129436 [details]
git bisect result
good commits are those which survived 5 warm reboots without failing to suspend
Created attachment 129437 [details]
first bad commit dmesg
(In reply to Dorota Czaplejewicz from comment #26)
> This commit came out as bad:
> commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24
> Merge: a2cd64f 2cfe8f8
> Author: David S. Miller <firstname.lastname@example.org>
> Date: Sun Jan 8 22:01:22 2017 -0500
> Merge branch 'bcm_sf2-fixes'
And that one has nothing to do with Intel graphics...
Based on comment 24, comment 26 and comment 29 I would propose this to be closed. Should we pass this bug to other product+component or even another bugzilla?
Agree, it should be closed.
Thanks Doug Smythies for your confirmation