Bug 54089

Summary: [SNB regression] Power consumption goes postal after resume
Product: DRI Reporter: Christian Speckner <cnspeckn>
Component: DRM/IntelAssignee: Rodrigo Vivi <rodrigo.vivi>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: high CC: ali, arthur.titeica, chris, daniel, darose, florian, freedesktop.org, i.gnatenko.brain, inform, ingemar.adahl, jbarnes, jj, jvpgomes, koct9i, loic, losinski, mariusz.libera, mswal2846, nekohayo, philipp, pl4nkton, samuel, tarmo, theholyettlz, thilo, tiagomatos, tom111, tomi, uwe.sommerlatt, wolfgang.gradl, xhejtman
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Fix (hack?) to reenable PM after suspend
none
dmesg after S3 - using rc6=4 33W after S3 / 25W before S3
none
Dump RP interrupts limits
none
hold rc6 while enabling rps
none
msleep hack
none
Rej file
none
Fix-SNB-RC6-init-sequence none

Description Christian Speckner 2012-08-26 16:12:31 UTC
Created attachment 66140 [details] [review]
Fix (hack?) to reenable PM after suspend

Going from linux 3.3.1 (which worked fine) to 3.6-rc2/3, I find that power consumption goes postal after a suspend-resume cycle. Normally, it would be around 10-11 Watts, but after resume, it rises to 25 Watts. As I couldn't find any rogue processes and the CPU clock looks fine, I suspected the i915 driver. Looking at the source, I put together the attached patch to i915_suspend.c (modelled after how things were in 3.3.1), which fixes the problem for me, so this is definitely a GPU problem.

Machine: thinkpad T420 core i7 with sandybridge integrated graphics
Distribution: Gentoo linux
Comment 1 Chris Wilson 2012-08-26 16:20:45 UTC
intel_enable_gt_powersave() should already be called by intel_modeset_init_hw() during i915_drm_thaw() upon resume. Can you add some printk to see if it is indeed called?
Comment 2 Christian Speckner 2012-08-26 17:39:06 UTC
Hi Chris! Turns out I indeed was a bit quick and should have looked at the problem and the proposed solution a bit harder. After lots of rebooting and testing, I believe to have found out the following:

1) The problem is more complex than I thought. When resuming from an idle desktop without graphics activity, there is a chance that power consumption will remain sane. However, as soon as I put load on the GPU, suspending and resuming will provoke the problem, and once the GPU is in this state, there's no going back, subsequent suspend-resume cycles will not alleviate the problem.

2) The clock gating bit seems to have no effect, and leaving the mutex locked gets rid of the two nasty warnings in dmesg which I overlooked.

3) Indeed, you are correct, intel_enable_gt_powersave() is already called during i915_drm_thaw() and I could have noticed earlier as it already prints a line to dmesg without my intervention.

4) Still, the additional call to intel_enable_gt_powersave() I added seems to fix the problem, and I have not been able to reproduce the issue, although I extensively tried. Removing the call though immediately brings back the problem, even though intel_enable_gt_powersave() still is called from i915_drm_thaw(). Adding a printk reveals what you propably know anyway: the call I added happens before i915_drm_thaw(), so, to me it looks as the unmodified code restores the GPU to a power-hungry state which even the subsequent call to intel_enable_gt_powersave() from i915_drm_thaw() will not revert.

I hope that this is not just a placebo. I use suspend-resume on this machine excessively, so I will keep an eye during the next week and will augment the bug in case the solution turns out to be bogus.
Comment 3 Chris Wilson 2012-08-26 17:44:01 UTC
Thanks. If you can confirm that your fix remains true after a week or so of testing, that will be useful. It sounds like the order that we write the registers here is critical, so possibly some state we've overlooked.
Comment 4 Christian Speckner 2012-09-03 06:59:39 UTC
During one week of copious suspend-resume cycles and at least one daily power cycle, the issue has not resurfaced, so I am confident that my modification indeed fixes the problem.
Comment 5 Chris Wilson 2012-09-05 13:08:05 UTC
Christian, do you mind narrowing down what sets of writes are required at that time?
Comment 6 Christian Speckner 2012-09-07 07:28:54 UTC
No problem, but it might take one or two weeks until I find time to do so.
Comment 7 Rodrigo Vivi 2012-09-10 20:03:16 UTC
I confirm the bug. When setting any level of i915_enable_rc6 in command line the power consumption increases from 26W to 33W (here in my Lenovo X1 SNB).

Unfortunately Christian's fix didn't work out for me. I'm still getting the same values with or without his patch.

However I also noticed that when not using the i915_enable_rc6 in command line the issue doesn't occur. Even using i915_enable_rc6=0 I get the issue here. When not using it i915_enable_rc6=-1 and RC6 is enabled by default on SNB.

I couldn't notice any diff in bios or registers or in dmesg that could justify that behavior.
Comment 8 Rodrigo Vivi 2012-09-10 20:05:08 UTC
Created attachment 66941 [details]
dmesg after S3 - using rc6=4 33W after S3 / 25W before S3
Comment 9 Christian Speckner 2012-09-11 08:23:44 UTC
I have not yet found time to cut down the number of register writes. However, I have seen the problem return since I last posted, but, unlike without the added intel_enable_gt_powersave() call, it is cured by an additional suspend-resume cycle. I tried replacing the call with a mdelay(), but that didn't help, the original problem returned immediately. I'll try to nail down the writes that fix it for me as soon as possible.
Comment 10 Chris Wilson 2012-09-13 17:24:38 UTC
(In reply to comment #8)
> Created attachment 66941 [details]
> dmesg after S3 - using rc6=4 33W after S3 / 25W before S3

Note that rc6=4 is not meant to work on SNB, only rc6=1 is known to be reliable.

What is the power consumption after boot with rc6=0?
Comment 11 Rodrigo Vivi 2012-09-17 16:32:54 UTC
On my Lenovo X1 SNB.

rc6=-1 - before s3 = 25W / after 25W
rc6=0 - before s3 = 23W / after 33W
rc6={1,2,4} - before s3 = 26W / after 33W
Comment 12 Lukas Hejtmanek 2012-10-08 15:36:36 UTC
Same for me. After resume I got 30W power, before suspend, I got 12W. powertop reports GPU active 100%. I have i915.i915_enable_rc6=1 on cmd line of kernel. (Kernel is 3.6.)
Comment 13 Chris Wilson 2012-10-09 08:22:59 UTC
Can we have a look at /sys/kernel/debug/dri/0/i915_cur_delayinfo and /sys/kernel/debug/dri/0/i915_drpc_info before/after resume?
Comment 14 Lukas Hejtmanek 2012-10-10 08:19:56 UTC
Before suspend:
cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x00000d29
RPSTAT1: 0x00048d00
Render p-state ratio: 13
Render p-state VID: 41
Render p-state limit: 255
CAGF: 650MHz
RP CUR UP EI: 47196us
RP CUR UP: 4us
RP PREV UP: 0us
RP CUR DOWN EI: 0us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1300MHz

cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: RC6
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 90945341
RC6+ residency since boot: 0
RC6++ residency since boot: 0

I provide "after suspend" values as soon as I manage to reproduce it. It seems it does not happen every time.
Comment 15 Lukas Hejtmanek 2012-10-10 16:54:29 UTC
This is after resume:
cat /sys/kernel/debug/dri/0/i915_cur_delayinfo
GT_PERF_STATUS: 0x00001ac8
RPSTAT1: 0x00049a19
Render p-state ratio: 26
Render p-state VID: 200
Render p-state limit: 255
CAGF: 1300MHz
RP CUR UP EI: 23573us
RP CUR UP: 23573us
RP PREV UP: 66000us
RP CUR DOWN EI: 100828us
RP CUR DOWN: 0us
RP PREV DOWN: 0us
Lowest (RPN) frequency: 650MHz
Nominal (RP1) frequency: 650MHz
Max non-overclocked (RP0) frequency: 1300MHz

cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: yes
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: on
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 580794
RC6+ residency since boot: 0
RC6++ residency since boot: 0

However:
while true; do grep "RC6 residency since" /sys/kernel/debug/dri/0/i915_drpc_info; sleep 10; done
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794
RC6 residency since boot: 580794

I suppose, it should be rising..
Comment 16 Chris Wilson 2012-10-10 17:08:17 UTC
Created attachment 68407 [details] [review]
Dump RP interrupts limits

Drat missing some infomation, can you also apply the attached and dump again in the future?

In the meantime, grab intel-gpu-tools, and do intel_forcewaked & intel_reg_read 0xa014
Comment 17 Lukas Hejtmanek 2012-10-10 17:18:25 UTC
intel_forcewaked & intel_reg_read 0xa014
[1] 14876
Forcewake locked
0xA014 : 0xD0000
Comment 18 Chris Wilson 2012-10-11 09:06:35 UTC
Ok, that's the issue. We've programmed the up/down clocking interrupts as if we are in the lowest power state, when evidently we are still at max frequency - and so we never get a down clock.
Comment 19 Chris Wilson 2012-10-11 09:12:53 UTC
This ties in with your original observation that we needed to initialise gt powersaving twice, which makes me think that the ordering of intel_modeset_init_hw is wrong or that there is a delay required after setting some of the powerwells and reading values?
Comment 20 Lukas Hejtmanek 2012-10-11 12:24:06 UTC
Is there anything more I can provide?

Strange is, that it does not happen after each suspend/resume cycle. Do not know how many or what exactly has to happen so that RC6 is newer reached.
Comment 21 Chris Wilson 2012-10-11 12:37:37 UTC
Christian, depends how board you are and want to play with randomly moving functions about and inserting delays? Otherwise I'd let Rodrigo see if he can find a pattern. :)
Comment 22 Christian Speckner 2012-10-11 13:06:29 UTC
Hi! Life has been pretty busy over the last couple of weeks, so must admit that I forgot about the issue ;) I think I'll find time to play around with the sequence and delays over the weekend, though.
Comment 23 Christian Speckner 2012-10-13 15:33:07 UTC
After cutting down the code, it seems that putting

gen6_gt_force_wake_get(dev->dev_private);
mdelay (10);
gen6_gt_force_wake_put(dev->dev_private);

(instead of the full intel_enable_gt_powersave) into i915_restore_state just before "/* Cache mode state */" is enough the fix the error for me. Leaving out the mdelay makes the behavior unreliable: sometimes, power consumption will be sane after resume, sometimes it won't. Just the delay is not sufficient either. Note that 10 msecs is propably way more than necessary, but I didn't have the patience to experiment with the value ;)
Comment 24 Christian Speckner 2012-10-13 16:07:37 UTC
Stupid me; the mdelay should be a msleep, and it is indeed enough. Let me explain: when experimenting I used a msleep by accident; I really wanted to use a mdelay. When I found that the code snipped which I wrote in my earlier post worked, I remembered that I had been toying with a mdelay at this place already before, without effect, so I did not retest. When writing the post, I noticed the msleep and decided to turn it into a mdelay, as I never would have thought that this makes a change. As I didn't want to rewrite the post after a reboot, I finished it and then rebooted to do a final check. However, to my total puzzlement, the mdelay variant was broken again!

So, in a nutshell: a msleep(10) does the trick, nothing else needed. Funnily, mdelay does _not_ work, even if I increase the value to 50 which should be at least 2-3 times the granularity of msleep. I have no idea about the reason for this strange behavior (a race? but against which thread?), but the effect seems to be solid.
Comment 25 Mark 2012-10-26 00:42:52 UTC
So I applied the patch to my Lenovo X220 FC17 3.6.3 system and, so far, I am no longer experiencing this issue upon un-suspending.
Comment 26 Daniel Vetter 2012-10-29 16:15:35 UTC
Created attachment 69241 [details] [review]
hold rc6 while enabling rps

Can you please test this patch here?
Comment 27 Mark 2012-10-29 16:20:56 UTC
Sure, Daniel.  Quick question, does this patch go on top of the previous patch or instead of the previous patch?
Comment 28 Daniel Vetter 2012-10-29 17:27:26 UTC
Instead of the hack, maybe this is the real issue here.(In reply to comment #27)
> Sure, Daniel.  Quick question, does this patch go on top of the previous
> patch or instead of the previous patch?

Instead of the hack.
Comment 29 Mark 2012-10-29 19:57:49 UTC
This patch did not work for me.  As soon as I come out of suspension, the fan starting running higher and higher and the temperature kept climbing.

By the way, the "hack" didn't work completely either. Even though I've suspended and come out of suspension successfully several times since I applied the "hack" on 10/26, today (10/29) when I unsuspended, I experienced the high temperature problem.
Comment 30 Christian Speckner 2012-10-29 20:35:15 UTC
Daniel, I just tested the patch, unfortunately, it does not help. I took the liberty of attaching the diff of the msleep hack I am currently using which makes the issue rare enough for me to live with it.
Comment 31 Christian Speckner 2012-10-29 20:35:51 UTC
Created attachment 69253 [details]
msleep hack
Comment 32 Mark 2012-10-30 00:14:04 UTC
The patch ("hack") that I'm using I got here:  https://bugzilla.kernel.org/show_bug.cgi?id=48721  ... it's similar to Christian's.
Comment 33 Daniel Vetter 2012-10-30 08:22:29 UTC
Can you please test this patch:

https://patchwork.kernel.org/patch/1652801/
Comment 34 Mark 2012-10-30 11:44:04 UTC
Created attachment 69299 [details]
Rej file
Comment 35 Mark 2012-10-30 11:45:59 UTC
When I go to compile/apply the patch, I get this:

+ ApplyPatch 2-3-drm-i915-put-ring-frequency-and-turbo-setup-into-a-work-queue-v2.patch
+ local patch=2-3-drm-i915-put-ring-frequency-and-turbo-setup-into-a-work-queue-v2.patch
+ shift
+ '[' '!' -f /home/mswallow/rpmbuild/SOURCES/2-3-drm-i915-put-ring-frequency-and-turbo-setup-into-a-work-queue-v2.patch ']'
Patch9992: 2-3-drm-i915-put-ring-frequency-and-turbo-setup-into-a-work-queue-v2.patch
+ case "$patch" in
+ patch -p1 -F1 -s
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_dma.c.rej
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_drv.c.rej
2 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_drv.h.rej
1 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_pm.c.rej
error: Bad exit status from /var/tmp/rpm-tmp.reuLT2 (%prep)


RPM build errors:
    Bad exit status from /var/tmp/rpm-tmp.reuLT2 (%prep)
Comment 36 Rodrigo Vivi 2012-10-30 16:54:11 UTC
Jesse's patch fixed issue for me.
Comment 37 Christian Speckner 2012-10-30 21:18:20 UTC
Also for me, the patch does not apply agains 3.6.2. What kernel version should it be applied agains?
Comment 38 Chris Wilson 2012-10-31 09:53:50 UTC
(In reply to comment #37)
> Also for me, the patch does not apply agains 3.6.2. What kernel version
> should it be applied agains?

Try http://cgit.freedesktop.org/~danvet/drm-intel #drm-intel-next

However, it is a similar gross hack as to the msleep() that does nothing to explain the bug or guarantee that the w/a is sufficient. The essence would again be that we need to allow some time to elapse between programming some GT state and enabling RPS.
Comment 39 Lukas Hejtmanek 2012-11-02 09:30:04 UTC
(In reply to comment #31)
> Created attachment 69253 [details]
> msleep hack

does not work for me :(
Comment 40 Daniel Vetter 2012-11-09 20:34:46 UTC
Call for Testers: Can everyone please test please test the drm-intel-next-queued branch from

http://cgit.freedesktop.org/~danvet/drm-intel

specifically

commit a18033217df7d7ed7beca1e68b708e5bc6209a1c
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Fri Nov 2 11:14:00 2012 -0700

    drm/i915: put ring frequency and turbo setup into a work queue v5

Note that the drm-intel-next branch does not yet (as of this moment at least) have that patch. If this patch works for everyone, I'll submit it to the stable kernels - the msleep hack seems to be not good enough to some people, and the "moving things around" hack isn't too great either.
Comment 41 Wolfgang Gradl 2012-11-14 12:19:59 UTC
(In reply to comment #40)
> Call for Testers: Can everyone please test please test the
> drm-intel-next-queued branch from
> 
> http://cgit.freedesktop.org/~danvet/drm-intel
> 
> specifically
> 
> commit a18033217df7d7ed7beca1e68b708e5bc6209a1c

drm-intel-next as of Nov 13 seems to fix the issue for me (Thinkpad X220, 
with a core i5 with sandybridge integrated graphics, running Fedora 16).

Please let me know if you need more information / additional testing.

Thanks.
Comment 42 Mark 2012-11-20 19:33:29 UTC
I would really like to try this fix, but the patch does not apply to my 3.6.6 kernel.  How do I go about trying this?
Comment 43 Ingemar Ådahl 2012-11-22 16:20:10 UTC
I've been running 3.7-rc6, with the changes introduced in drm-intel-next merged in for a while now, and (In reply to comment #40)
> Call for Testers: Can everyone please test please test the
> drm-intel-next-queued branch from
> 
> http://cgit.freedesktop.org/~danvet/drm-intel
> 
> specifically
> 
> commit a18033217df7d7ed7beca1e68b708e5bc6209a1c
> Author: Jesse Barnes <jbarnes@virtuousgeek.org>
> Date:   Fri Nov 2 11:14:00 2012 -0700
> 
>     drm/i915: put ring frequency and turbo setup into a work queue v5
> 
> Note that the drm-intel-next branch does not yet (as of this moment at
> least) have that patch. If this patch works for everyone, I'll submit it to
> the stable kernels - the msleep hack seems to be not good enough to some
> people, and the "moving things around" hack isn't too great either.
Comment 44 Ingemar Ådahl 2012-11-22 16:31:31 UTC
*** Bug 55871 has been marked as a duplicate of this bug. ***
Comment 45 Ingemar Ådahl 2012-11-22 17:04:57 UTC
(In reply to comment #43)
> I've been running 3.7-rc6, with the changes introduced in drm-intel-next
> merged in for a while now, and (In reply to comment #40)
> > Call for Testers: Can everyone please test please test the
> > ...
Sorry about that, had some sort of 'mid-air collision' with myself it seems.. What I was trying to say was that running 3.7-rc6 with drm-intel-next (which as Wolfgang said includes Jesse's commit) merged in seems to resolve the issue for me as well, just wanted to let you know..
Comment 46 Daniel Vetter 2012-11-22 20:28:28 UTC
Thanks everyone for testing, I'll tentatively close this for now. Jesse is still working on a real fix instead of duct-tape, so we might ask for a bit of testing still.
Comment 47 Mark 2012-11-23 14:58:21 UTC
So when will we see this?  Kernel update 2.??
Comment 48 Daniel Vetter 2012-11-23 17:18:39 UTC
Going into 3.8 atm. It requires quite some code rework, so unlikely to get backported. The real fix Jesse is working on though might (since that will be less invasive).
Comment 49 Christian Speckner 2012-11-25 22:53:20 UTC
Sorry for not reporting back for so long, I was busy with real life. For the record: 3.7.0-rc6 from drm-intel-next positively fixes the issue for me. Thanks a lot to everybody involved in fixing ;)
Comment 50 Florian Mickler 2012-12-22 09:24:10 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc1:

commit 1a01ab3b2dc4394c46c4c3230805748f632f6f74
Author: Jesse Barnes <jbarnes@virtuousgeek.org>
Date:   Fri Nov 2 11:14:00 2012 -0700

    drm/i915: put ring frequency and turbo setup into a work queue v5
Comment 51 Thomas Kahle 2013-01-25 09:17:23 UTC
(In reply to comment #50)
> A patch referencing this bug report has been merged in Linux v3.8-rc1:
> 
> commit 1a01ab3b2dc4394c46c4c3230805748f632f6f74
> Author: Jesse Barnes <jbarnes@virtuousgeek.org>
> Date:   Fri Nov 2 11:14:00 2012 -0700
> 
>     drm/i915: put ring frequency and turbo setup into a work queue v5

I recently tried 3.8-rc3 and I'm still seeing the exact same issue described in comments 12-14.  The likelihood of it happening is lower now, but it happen in maybe 10% of resumes.
Comment 52 Jani Nikula 2013-01-25 10:43:32 UTC
(In reply to comment #51)
> I recently tried 3.8-rc3 and I'm still seeing the exact same issue described
> in comments 12-14.  The likelihood of it happening is lower now, but it
> happen in maybe 10% of resumes.

I wonder if that could be https://bugzilla.kernel.org/show_bug.cgi?id=52411

fixed by
commit b514407547890686572606c9dfa4b7f832db9958
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Thu Jan 17 10:24:09 2013 +0200

    drm/i915: fix FORCEWAKE posting reads

in drm-intel-fixes and queued for stable.
Comment 53 Thomas Kahle 2013-01-25 12:48:29 UTC
(In reply to comment #52)
> (In reply to comment #51)
> > I recently tried 3.8-rc3 and I'm still seeing the exact same issue described
> > in comments 12-14.  The likelihood of it happening is lower now, but it
> > happen in maybe 10% of resumes.
> 
> I wonder if that could be https://bugzilla.kernel.org/show_bug.cgi?id=52411
> 
> fixed by
> commit b514407547890686572606c9dfa4b7f832db9958
> Author: Jani Nikula <jani.nikula@intel.com>
> Date:   Thu Jan 17 10:24:09 2013 +0200
> 
>     drm/i915: fix FORCEWAKE posting reads
> 
> in drm-intel-fixes and queued for stable.

Quite possible.  I'm testing your patch now and could not reproduce the issue so far.
Comment 54 Thomas Kahle 2013-01-25 15:58:01 UTC
(In reply to comment #53)
> Quite possible.  I'm testing your patch now and could not reproduce the
> issue so far.

I reproduced the issue with 3.8-rc4 and the patch from https://bugzilla.kernel.org/show_bug.cgi?id=52411 
:(
Comment 55 Rodrigo Vivi 2013-02-16 03:46:49 UTC
Created attachment 74922 [details] [review]
Fix-SNB-RC6-init-sequence
Comment 56 Rodrigo Vivi 2013-02-16 03:48:43 UTC
Can you reproduce it at any time?

I think I reproduced it here, but when I start measure with powermeter I couldn't verify it anymore.

If you are able to reproduce at any time, could you please test the patch that I'm attaching here? This patch above changes the current init sequence for one that I have documented here, line by line, value by value. In the end it is not so efficient as the current one... around 0.6W more, but I think it might be useful to check if it kills current bugs related to RC6 in SNB. just in case...

thanks
Comment 57 Andreas Kloeckner 2013-04-03 17:51:45 UTC
FWIW, this is not fixed with

Linux ding 3.8-trunk-amd64 #1 SMP Debian 3.8.3-1~experimental.1 x86_64 GNU/Linux
xserver-xorg-video-intel 2.20.14-1 (from Debian)

It seems somewhat specific to gnome. I had been using KDE for a while, and I don't recall this being a problem.

Andreas
Comment 58 Jean-François Fortin Tam 2013-04-19 21:59:09 UTC
Rodrigo, I'm not knowledgeable enough to actually debug this or try kernel/driver patches, but if you want a more detailed account of the situation, see my latest comment in https://bugzilla.redhat.com/show_bug.cgi?id=866212#c77 - basically, just grab a sandybridge laptop like the Thinkpad X220 and suspend/resume repeatedly until you see the GPU being stuck in "powered on" state in powertop.
Comment 59 João Gomes 2013-04-20 11:16:10 UTC
I tried kernel 3.8.8 and it seems that the problem is still present.
I also noticed that if I turn the laptop on and I take some time to login, it will enter in the same condition, with the GPU not going to rc6.
Comment 60 Konstantin Khlebnikov 2013-05-15 03:22:00 UTC
Bug is still here (I'm using 3.9) and it's really annoying.
My laptop (ordinary x220) becomes really hot each time when this happens.
It heats up to 80°C just in idle state when it works from battery. It's sucks.

If you have some patches / need some information let me know. I have serious experience in linux kernel development so I can help if you show me where this logic is placed in your driver.
Comment 61 Mariusz Libera 2013-06-09 09:16:16 UTC
Same for me since kernel 3.9. Never had this problem before. I have i3-2310M. After resume powertop shows that GPU is always Powered On. Only workaround seems to be suspend, wait for laptop to cool down, resume, suspend again while it's still cool, resume and it's back to normal.
Comment 62 Chris Wilson 2013-06-12 09:31:29 UTC
Can you please try with this patch: https://patchwork.kernel.org/patch/2707341/ as it claims to fix some instability with rc6 on SandyBridge?
Comment 63 Tomas Janousek 2013-06-13 09:29:13 UTC
ThinkPad T420 + v3.9.5 + https://patchwork.kernel.org/patch/2707341/ ⇒ no rc6 at all, ever. :-(
Comment 64 Konstantin Khlebnikov 2013-06-15 14:39:43 UTC
I have checked all patches which I found here. Without any success.

Bug happens not only after suspend-resume cycle, sometimes it just happens without any particular activity, for example when I'm reading emails. Suspend-resume is just fastest way to reproduce this, for my x220 it's taking just couple suspend-resume cycles. I have tried to disable turbo-mode, and even set GEN6_RP_CONTROL to zero, so it's something deeper. Also I cannot find any lightweight magic sequence for returning gpu into sane state which can be used in watchdog.

When bugs happens, gpu generates bunch of GEN6_PM_RP_UP_EI_EXPIRED | GEN6_PM_RP_UP_THRESHOLD events. Seems like some its internal state is broken and gpu thinks that it's extremely busy. Is there any way to examine or reset this state?
Comment 65 Konstantin Khlebnikov 2013-06-24 05:48:07 UTC
Ok, seems like this happens only after waking up from s2ram. Sometimes it's hard to notice because system heats up by degrees and cooler starts working not immediately.
Comment 66 Chris Wilson 2013-07-10 19:23:44 UTC
*** Bug 66787 has been marked as a duplicate of this bug. ***
Comment 67 Samuel Sieb 2013-07-10 20:20:34 UTC
This seems to be related to Bug 48721.

Further to my comment 64 on that bug, I've found an even stranger pattern.  When I directly invoke the pm programs (pm-suspend-hybrid or pm-suspend), it's almost guaranteed to trigger this on resume, but using the suspend key through whatever route that takes usually doesn't trigger it.  Once triggered, retrying the pm programs has never fixed it.  However, using the suspend key usually fixes it on the first try.
Comment 68 Samuel Sieb 2013-07-10 20:24:10 UTC
Sorry, I appear to have crossed bugzilla instances.  The bug I was referring to is actually https://bugzilla.kernel.org/show_bug.cgi?id=48721 and was mentioned earlier here in comment 32.
Comment 69 Konstantin Khlebnikov 2013-07-14 13:40:21 UTC
I've found it!

Strightforward git bisect tells that e6b0b6a82f9c93fe3dd060ae54719456474a74a3 is the first bad commit. But it's merge commit which merges v3.5-rc7 into drm-next. None of it's parents has this problem. So, I've rebased all these patches from drm-next to v3.5-rc7 and bisected among them.

And the winner is b4ae3f22d238617ca11610b29fde16cf8c0bc6e0 (drm/i915: load boot context at driver init time) which refers to https://bugs.freedesktop.org/show_bug.cgi?id=50237 Seems like there is no public documentation about this stuff so I have no idea what the hell is happening here.

I've reverted that commit from 3.9.9 and it works like a charm. After that my automatic script unable to reproduce this problem.
Comment 70 Chris Wilson 2013-07-14 13:48:03 UTC
Can you please try this:

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 7756668..12a9d9c 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -4551,6 +4551,7 @@ static void gen6_init_clock_gating(struct drm_device *dev)
 {
        struct drm_i915_private *dev_priv = dev->dev_private;
        uint32_t dspclk_gate = ILK_VRHUNIT_CLOCK_GATE_DISABLE;
+       u32 tmp;
 
        I915_WRITE(ILK_DSPCLK_GATE_D, dspclk_gate);
 
@@ -4622,8 +4623,11 @@ static void gen6_init_clock_gating(struct drm_device *dev)
                   ILK_DPFDUNIT_CLOCK_GATE_ENABLE);
 
        /* WaMbcDriverBootEnable:snb */
-       I915_WRITE(GEN6_MBCTL, I915_READ(GEN6_MBCTL) |
-                  GEN6_MBCTL_ENABLE_BOOT_FETCH);
+       tmp = I915_READ(GEN6_MBCTL);
+       I915_WRITE(GEN6_MBCTL, tmp | GEN6_MBCTL_ENABLE_BOOT_FETCH);
+       POSTING_READ(GEN6_MBCTL);
+       usleep(100);
+       I915_WRITE(GEN6_MBCTL, tmp & ~GEN6_MBCTL_ENABLE_BOOT_FETCH);
 
        g4x_disable_trickle_feed(dev);
Comment 71 Konstantin Khlebnikov 2013-07-14 13:59:48 UTC
No, it does not help.

(there is no usleep(), I've replaced it with udelay(100))
Comment 72 Konstantin Khlebnikov 2013-07-14 14:32:40 UTC
Seems like this one helps, but I need more time to be sure.

--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -3607,6 +3607,8 @@ static void gen6_init_clock_gating(struct drm_device *dev)
        int pipe;
        uint32_t dspclk_gate = ILK_VRHUNIT_CLOCK_GATE_DISABLE;
 
+       gen6_gt_force_wake_get(dev_priv);
+
        I915_WRITE(ILK_DSPCLK_GATE_D, dspclk_gate);
 
        I915_WRITE(ILK_DISPLAY_CHICKEN2,
@@ -3695,6 +3697,8 @@ static void gen6_init_clock_gating(struct drm_device *dev)
        cpt_init_clock_gating(dev);
 
        gen6_check_mch_setup(dev);
+
+       gen6_gt_force_wake_put(dev_priv);
 }
Comment 73 Konstantin Khlebnikov 2013-07-14 16:35:34 UTC
Seems like it works, at least for me. I've sent the patch.
Comment 74 Chris Wilson 2013-07-17 11:25:18 UTC
commit 7dcd2677ea912573d9ed4bcd629b0023b2d11505
Author: Konstantin Khlebnikov <khlebnikov@openvz.org>
Date:   Wed Jul 17 10:22:58 2013 +0400

    drm/i915: fix long-standing SNB regression in power consumption after resume
Comment 75 Arkadiusz Miskiewicz 2014-09-15 18:06:18 UTC
Is that possible that this issue came back? Using 3.16 and 3.17git kernels. After resume from ram power usage grows about 10W.

powertop shows that "CPU core" power usage rises a lot (GPU is part of this right?) but not sure if this can be trusted.

Idle states are
                    | Powered On  3,3%    |
                    | RC6        96,7%    |

so rc6 is being used mostly. Even after resume. 

Happens after every suspend & resume cycle. This is Dell XPS 15 9530.
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06) (prog-if 00 [VGA controller])

Example, before suspend was ~26W, now:

The battery reports a discharge rate of 39.1 W
System baseline power is estimated at 35.8 W

Power est.    Usage     Device name
  16.5 W     11,0%        CPU core
  6.98 W     60,0%        Display backlight
  5.75 W     11,0%        CPU misc
  2.24 W     49,9 pkts/s  Network interface: wlan0 (iwlwifi)
  2.01 W     78,2 ops/s   GPU misc
  1.61 W     11,0%        DRAM
  620 mW    100,0%        USB device: USB Receiver (Logitech)
  106 mW     78,2 ops/s   GPU core
    0 mW      0,1 pkts/s  nic:tap0
    0 mW    100,0%        Radio device: iwlwifi
    0 mW    100,0%        USB device: xHCI Host Controller


How I can track this more?
Comment 76 Arkadiusz Miskiewicz 2014-09-15 18:10:45 UTC
Ok, I see that SNB and Haswell use different init functions for clock gating... so it's not that the "issue is back" - just haswell seems to be affected in  a similar way. Probably new bug report makes sense.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.