> This started somewhere between Kernel 4.2 and 4.3-rc1, > but I only noticed it a day ago. > > The first S3 suspend after a fresh boot works fine. > Thereafter, suspends simply resume again immediately. > > I get the following errors on my console: > > [ 152.697247] i915 0000:00:02.0: GEM idle failed, resume might fail > [ 152.697258] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11 > [ 152.697262] dpm_run_callback(): pci_pm_suspend+0x0/0x140 returns -11 > [ 152.697264] PM: Device 0000:00:02.0 failed to suspend async: error -11 > [ 152.697306] PM: Some devices failed to suspend, or early wake event detected > > The issue is not limited to my normal way of doing suspend, using "pm-suspend". > It also happens using the "echo mem > /sys/power/state" method. > > The kernel was bisected, and the result was double checked by clean compiles > of the first bad commit and the immediately preceding commit. Bisect results > copied below: > > $ git bisect good > dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 is the first bad commit > commit dc4be6071a24f0d2da6af8ce16c19f276ac4d7a2 > Author: John Harrison <John.C.Harrison at Intel.com> > Date: Fri May 29 17:43:39 2015 +0100 > > drm/i915: Add explicit request management to i915_gem_init_hw() > > Now that a single per ring loop is being done for all the different > intialisation steps in i915_gem_init_hw(), it is possible to add proper request > management as well. The last remaining issue is that the context enable call > eventually ends up within *_render_state_init() and this does its own private > _i915_add_request() call. > > This patch adds explicit request creation and submission to the top level loop > and removes the add_request() from deep within the sub-functions. > > v2: Updated for removal of batch_obj from add_request call in previous patch. > > For: VIZ-5115 > Signed-off-by: John Harrison <John.C.Harrison at Intel.com> > Reviewed-by: Tomas Elf <tomas.elf at intel.com> > Signed-off-by: Daniel Vetter <daniel.vetter at ffwll.ch> > > :040000 040000 789c630ff3f5f07238a5df1bde79187c6c1251d0 2da3f7e20e2642d8eebd9f72528923c2ac53a8cb M drivers
This bug was created for tracking purposes, was reported to the intel gfx list, refeer to http://lists.freedesktop.org/archives/intel-gfx/2015-October/077592.html
As discussed please follow up on the m-l with a link to each regression tracking bug you create so that the links go both ways. Thanks.
Please use links that contain the Message-ID so that it's easier to find the messages in email. Please reference the original report. Like this: http://mid.gmane.org/002301d1025d$d5765090$8062f1b0$@net
John, ideas?
Additional information: After the first resume from suspend, the processor is in a bizarre state, where it will not go below 2.4 GHz, even though every CPU is asking for a pstate of 16 (the minimum for my processor). This has been tested several times, on both the preceding (good) and first bad kernels using both methods of suspend. My processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz Example (no load): pstate being asked for: # rdmsr --bitfield 15:8 -d -a 0x199 16 16 16 16 16 16 16 16 pstate that I am getting: # rdmsr --bitfield 15:8 -d -a 0x198 24 24 24 24 24 24 24 24 CPU freqs: # grep MHz /proc/cpuinfo cpu MHz : 2400.054 cpu MHz : 2399.921 cpu MHz : 2399.921 cpu MHz : 2399.921 cpu MHz : 2399.789 cpu MHz : 2399.789 cpu MHz : 2399.921 cpu MHz : 2399.921
Did you double check the bisect by running both dc4be6071a24 and dc4be6071a24^ ?
(In reply to Jani Nikula from comment #6) > Did you double check the bisect by running both dc4be6071a24 and > dc4be6071a24^ ? Yes, of course, and I said so in my initial e-mail. Truth be known, this was my second bisection, as I must have made a mistake in my first attempt, because the double check failed.
(In reply to Doug Smythies from comment #7) > (In reply to Jani Nikula from comment #6) > > Did you double check the bisect by running both dc4be6071a24 and > > dc4be6071a24^ ? > > Yes, of course, and I said so in my initial e-mail. > Truth be known, this was my second bisection, as I must have made a mistake > in my first attempt, because the double check failed. I asked, because I suspected the bisect result might be wrong. And the symptoms in comment #5 seem odd. Please try two things: First, run dc4be6071a24 and try suspend/resume several times, and see if it's 100% reproducible or not. Second, attach dmesg with drm.debug=14 module parameter set (for the failing case).
Created attachment 118873 [details] requested dmesg with drm.debug=14 Possibly relevant excerpt: [ 399.518389] [drm] stuck on render ring [ 399.518686] [drm] GPU HANG: ecode 6:0:0xfeffffff, reason: Ring hung, action: reset [ 399.518686] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 399.518686] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 399.518687] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 399.518687] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 399.518687] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 399.518699] [drm:i915_reset_and_wakeup] resetting chip [ 399.518724] i915 0000:00:02.0: GEM idle failed, resume might fail [ 399.518737] pci_pm_suspend(): i915_pm_suspend+0x0/0x50 [i915] returns -11 [ 399.518739] dpm_run_callback(): pci_pm_suspend+0x0/0x160 returns -11 [ 399.518741] PM: Device 0000:00:02.0 failed to suspend async: error -11 [ 399.518804] PM: Some devices failed to suspend, or early wake event detected an edited /sys/class/drm/card0/error will be attached in a moment.
Created attachment 118874 [details] an edited version of the file the previous attachment asked for I edited just to remove many lines of 0's.
(In reply to Jani Nikula from comment #8) > First, run dc4be6071a24 and try suspend/resume > several times, and see if it's 100% reproducible or not. Yes, it happens every time. To say 100%, I would have to have a sample space of about 1000 attempts. I did not do that many.
Additional information: After a fresh boot with the bad kernel, turbostat shows: Pkg%pc6 = 97.84%; PkgWatt 4.01; CorWatt 0.28; GFXWatt 0.23. Then after the first suspend resume, turbostat shows: Pkg%pc6 = 0.00%; PkgWatt 10.08; CorWatt 3.04; GFXWatt 3.51. PkgTmp goes up by more than 10 degrees. The system is idle is both cases: 0.03% busy.
I'm having the same problem with 4.3, skipped few versions before that. However, I seem to have found an earlier bug that could be related, at least it sounds similar: https://bugs.freedesktop.org/show_bug.cgi?id=90253
This issue persists though kernel 4.4-rc8.
Bug is still valid in 4.4 release.
Yes, the bug persists through kernel 4.4. Having isolated the issue down to the exact causal commit, I do not know what else I can do to move this one along.
Just an update: bug still exists in 4.5 rc5.
Bug still exists in 4.6 rc1. Is there anything I can provide to help with this issue?
This issue no longer occurs on my computer. The a fresh install on linux was done, and now the system is using systemd whereas previously it was not. I am not certain systemd is the difference, it is just my best guess.
(In reply to tigrangab from comment #18) > Bug still exists in 4.6 rc1. Is there anything I can provide to help with > this issue? Did you bisect this to the same commit reported in comment #0?
Highest+Blocker as being regression w/o workaround
Tested on drm-tip on IVB-3770, but the issue didn't appear: all suspends and resumes are fine.
tigrangab@gmail.com, can you check if failure still persist with latest drm-tip? For others it seem to be resolved, see comment 19 and comment 22.
While in comment 19 above, I mentioned that this issue no longer occurred on my computer, I did try to go back and re-install an older version of my distribution (Ubuntu) on another partition in an attempt to re-create the issue. I was unsuccessful. I tried without success again today. My original work and kernel bisection was good and repeatable. I do not understand why I can not re-create the failure scenario now. I can only assume it is because I did not install from the exact same iso starting point. The only hardware change was the hard disk that had failed. In comment 13 there is a reference to a bug with similar symptoms (https://bugs.freedesktop.org/show_bug.cgi?id=90253). However it can not be the same root issue, because, if I understand the dates correctly, the commit that this was isolated to did not exist when that bug report was entered.
I have to update my comment - probably I didn't check the correct kernel, but the issue mysteriously appeared between 4.9.0 (69973b830859bc6529a7a0468ba0d80ee5117826) and 4.10.0-rc6 from drm-tip: 2017y-02m-01d-11h-09m-17s UTC (eb9b7b42023edc1b5849d1ff3bef490b492067a3). The system seems to wake up immediately, and there's nothing special in dmesg, even though kernel command line includes drm.debug=0x1f $ cat /sys/power/state freeze mem disk $ echo 'mem' > /sys/power/state bash: echo: write error: Resource temporarily unavailable $ dmesg | tail [ 174.113592] systemd-journald[521]: Failed to set ACL on /var/log/journal/fe605962ccdd4f5dafb1348d1329bf81/user-1000.journal, ignoring: Operation not supported [ 209.294939] PM: Syncing filesystems ... done. [ 209.346235] PM: Preparing system for sleep (mem) System: Fedora 24 i7-3770 CPU @ 3.40GHz Intel HD 4000 Kernel config used: https://intel-gfx-ci.01.org/CI/CI_DRM_2133/kernel.config.bz2
I managed to get more info about the issue I'm seeing. The symptoms have been mostly consistent with my previous post. The failure to sleep does not happen every time; I had to reboot up to 5 times for the first failure to happen. Because of that, I'm not 100% certain if "good" commits are really bug-free - I tested at most 5 reboots. "Bad" commits are definitely correct though. Suspend failure is somewhat correlated to failures in dmesg, like: [ 10.287387] BUG: unable to handle kernel paging request at ffffffffa041d82 8 This commit came out as bad: commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24 Merge: a2cd64f 2cfe8f8 Author: David S. Miller <davem@davemloft.net> Date: Sun Jan 8 22:01:22 2017 -0500 Merge branch 'bcm_sf2-fixes'
Created attachment 129436 [details] git bisect result good commits are those which survived 5 warm reboots without failing to suspend
Created attachment 129437 [details] first bad commit dmesg
(In reply to Dorota Czaplejewicz from comment #26) > This commit came out as bad: > commit 03430fa10b99e95e3a15eb7c00978fb1652f3b24 > Merge: a2cd64f 2cfe8f8 > Author: David S. Miller <davem@davemloft.net> > Date: Sun Jan 8 22:01:22 2017 -0500 > > Merge branch 'bcm_sf2-fixes' And that one has nothing to do with Intel graphics...
Based on comment 24, comment 26 and comment 29 I would propose this to be closed. Should we pass this bug to other product+component or even another bugzilla?
Agree, it should be closed.
Thanks Doug Smythies for your confirmation
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.