72924 – GPU overheating after running a full screen GL app on Sandy Bridge

Bug 72924 - GPU overheating after running a full screen GL app on Sandy Bridge

Summary: GPU overheating after running a full screen GL app on Sandy Bridge

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-12-20 16:47 UTC by Armin K
Modified:	2017-07-24 22:56 UTC (History)
CC List:	1 user (show)

See Also:	68807
i915 platform:
i915 features:

Attachments
dmesg (58.17 KB, text/plain) 2013-12-21 01:44 UTC, Armin K	no flags	Details
Kernel config (79.76 KB, text/plain) 2013-12-21 01:44 UTC, Armin K	no flags	Details
View All

Description Armin K 2013-12-20 16:47:29 UTC

Starting recently, when I am running a full screen GL app and stopping it after some time (be it a native linux game, a wine d3d game or a ogl screensaver on X, or simple-egl demo on Wayland compositor), my machine continues working at way higher temperature than it actually should.

It is normal for this laptop that it runs at > 90 °C (100 °C is critical, but never reached that one) when running a game on linux (~80 °C on ms windows) since it doesn't have that good cooling system, but normal temperature is arround 63 °C.

I have discovered, that after I leave my machine idle for some time, it turns on its screensaver, which is Blue Screen of Death screensaver from xscreensaver package, which is in turn also OpenGL screensaver and machine stays cool and quiet. But, when I try to start using my machine again (terminating screensaver), machine starts working at > 85 °C temperature, even if I didn't run anything else. CPU usage is reported to be < 1% on all 4 cpu threads. Same happens with any other OGL app, Wine D3D game or weston-simple-egl demo when ran fullscreen on Wayland compositor - Weston.

I do not recall that this was an issue with Mesa 9.2 and some pre 10.0 branching version, but it did occour somewhere between 9.2 and 10.0 branching iirc. I'll try to investigate more.

Installed packages:

linux-3.12.5
libdrm-2.4.50
wayland-1.3.0-21-g1521c62
llvm-195929 (svn revision)
Mesa-10.1.0-g2b404a6
weston-1.3.0-272-ga5059eb
xorg-server-1.14.99.904
xf86-video-intel-2.99.906-98-g9289e2c

All built with gcc-4.8.2, binutils-2.24 and glibc-2.18 (custom system, not Gentoo).

If some more info is needed, let me know.

Comment 1 Kenneth Graunke 2013-12-21 01:00:05 UTC

Most likely this is a kernel/power management problem, not a Mesa problem.  I've seen reports that the Sandybridge GPU sometimes gets stuck at the maximum clock frequency.  You can check that via cat /sys/class/drm/card0/gt_{min,cur,max}_freq_mhz.

It's highly unlikely this is a Mesa issue.

Comment 2 Armin K 2013-12-21 01:43:19 UTC

Thanks for the response. Indeed it is a kernel issue as you said it is.

Before starting xonotic-glx:

$ cat /sys/class/drm/card0/gt_{min,cur,max}_freq_mhz
650
650
1200

After quitting xonotic-glx after playing it for a minute or two:

$ cat /sys/class/drm/card0/gt_{min,cur,max}_freq_mhz
650
1200
1200

And it won't go down. It is possible that I got kernel 3.12 in some time before mesa 10.0 branching and that's why I was unable to reproduce the issue. I have all kinds of weird problems with power management even with radeon, not just intel it seems (hybrid gpus).

Kernel config and dmesg output attached.

Comment 3 Armin K 2013-12-21 01:44:08 UTC

Created attachment 91065 [details]
dmesg

Comment 4 Armin K 2013-12-21 01:44:37 UTC

Created attachment 91066 [details]
Kernel config

Comment 5 Chris Wilson 2013-12-22 10:03:23 UTC

Try a 3.13 kernel for a different algorithm for tuning RPS frequencies. The original one will maintain a frequency (even the highest) if there is any OpenGL activity. However, this should not prevent the GPU from entering rc6 (sleep mode) unless there is very frequent GPU activity.

Comment 6 Armin K 2013-12-22 19:35:08 UTC

I am running xfwm4 window manager, which I believe has render-based compositing enabled, but everything worked fine before 3.12 upgrade. I have i915.i915_enable_rc6=1 on the kernel command line, which was there for a long time already and I never noticed something like this. I'll try 3.13 kernel shortly, but I believe it might be worth investigating what broke this.

Comment 7 Chris Wilson 2013-12-22 19:41:31 UTC

iirc, what changed in 3.12 was that the rc6 activation period was increased, and we have had a number of reports that that has reduced rc6 efficacy on many machines.

Comment 8 Armin K 2013-12-22 20:59:53 UTC

From what I noticed, that period is more than 15 minutes here, and I ussually just reboot since I can't stand the fans going so loud for so long.

Do you have a link to that commit maybe? I want to try to revert it and see if it will fix my problem.

Comment 9 Armin K 2013-12-24 19:24:09 UTC

Seems to be working fine with kernel 3.13-rc5.

Comment 10 Chris Wilson 2013-12-30 13:13:29 UTC

For the rc6 commits, see:

commit 351aa5666d02062b52329bcfe4bcf9d1f882fba9
Author: Stéphane Marchesin <marcheu@chromium.org>
Date:   Tue Aug 13 11:55:17 2013 -0700

    drm/i915: tune the RC6 threshold for stability
    
    It's basically the same deal as the RC6+ issues on ivy bridge
    except this time with RC6 on sandy bridge. Like last time the
    core of the issue is that the timings don't work 100% with our
    voltage regulator. So from time to time, the kernel will print
    a warning message about the GPU not getting out of RC6. In
    particular, I found this fairly easy to reproduce during
    suspend/resume.
    
    Changing the threshold to 125000 instead of 50000 seems to fix
    the issue. The previous patch used 150000 but as it turns out
    this doesn't work everywhere. After getting such a machine, I
    bisected the highest value which works, which is 125000, so here
    it is.
    
    I also measured the idle power usage before/after this patch and
    didn't see a difference on a sandy bridge laptop. On haswell and
    up, it makes a big difference, so we want to keep it at 50k
    there. It also seems like haswell doesn't have the RC6 issues
    that sandy bridge has so the 50k value is fine.
    
    Signed-off-by: Stéphane Marchesin <marcheu@chromium.org>
    Acked-by: Jesse Barnes <jbarnes@virtuousgeek.org>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>


commit 29c78f609e661e663a239a37923adb1d61f6386c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sat Nov 16 16:04:26 2013 +0100

    Partially revert "drm/i915: tune the RC6 threshold for stability"
    
    This reverts commit 351aa5666d02062b52329bcfe4bcf9d1f882fba9.
    
    It breaks rc6 on at least one snb machine. Since we don't yet have a
    report for ivb let's keep it there for now.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=71656
    Cc: Stéphane Marchesin <marcheu@chromium.org>
    Cc: erik@vontaene.de
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>


But if 3.13 is cooler, that suggests it was RPS (gpu frequency).

Comment 11 Chris Wilson 2013-12-30 13:52:03 UTC

Probably related to (the same as!) bug 68807.

Comment 12 Chris Wilson 2014-01-08 10:05:21 UTC

If we happy with the updated RPS tuning, lets leave it at that.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.