Bug 99611 - [SNB] GPU hang after over temperature
Summary: [SNB] GPU hang after over temperature
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-01-31 08:58 UTC by Chris Tillman
Modified: 2017-07-11 14:18 UTC (History)
1 user (show)

See Also:
i915 platform: SNB
i915 features:


Attachments
syslog as system was overheating and recovering, then gpu crashed. (64.92 KB, text/plain)
2017-01-31 08:58 UTC, Chris Tillman
no flags Details
attachment-5677-0.html (3.16 KB, text/html)
2017-01-31 17:38 UTC, Chris Tillman
no flags Details
attachment-14853-0.html (4.79 KB, text/html)
2017-06-27 18:50 UTC, Chris Tillman
no flags Details
attachment-19646-0.html (4.17 KB, text/html)
2017-07-06 07:15 UTC, Chris Tillman
no flags Details

Description Chris Tillman 2017-01-31 08:58:11 UTC
Created attachment 129247 [details]
syslog as system was overheating and recovering, then gpu crashed.

root@ctillman:/home/chris# grep -i GPU /var/log/syslog
Jan 31 21:12:29 ctillman kernel: [64809.320303] [drm] GPU HANG: ecode 6:0:0x86fafffa, in Xorg [612], reason: Hang on render ring, action: reset
Jan 31 21:12:29 ctillman kernel: [64809.320305] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jan 31 21:12:29 ctillman kernel: [64809.320307] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jan 31 21:12:29 ctillman kernel: [64809.320336] drm/i915: Resetting chip after gpu hang
Jan 31 21:25:30 ctillman kernel: [    9.156348] RAPL PMU: hw unit of domain pp1-gpu 2^-16 Joules

The error log just said "no error state collected".

The user experience was a complete freeze of the computer requiring power off.
Comment 1 Chris Wilson 2017-01-31 10:10:07 UTC
The overheating is unlikely to be causal for the GPU hang or the system crash. However, all the information we need to triage the bug is in error state - whic as you found out is not preserved across reboot. Any chance you can grab it before the system lockup/reboot?
Comment 2 Chris Tillman 2017-01-31 17:38:48 UTC
Created attachment 129259 [details]
attachment-5677-0.html

As I also reported, the computer was completely non-responsive when the
crash occurred. No mouse, No Ctrl-Alt F keys response to access virtual
consoles, no response to Ctrl-Shift Backspace. So there was no chance to
investigate further.

I can try to check it out if I notice the laptop making a lot of noise and
feeling hot to the touch again. Any chance of making a change so it records
necessary information on disk which survives reboots? Perhaps in my case, a
script which records something every hour or minute?

On Tue, Jan 31, 2017 at 11:10 PM, <bugzilla-daemon@freedesktop.org> wrote:

> Chris Wilson <chris@chris-wilson.co.uk> changed bug 99611
> <https://bugs.freedesktop.org/show_bug.cgi?id=99611>
> What Removed Added
> Status NEW NEEDINFO
>
> *Comment # 1 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c1> on
> bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from Chris
> Wilson <chris@chris-wilson.co.uk> *
>
> The overheating is unlikely to be causal for the GPU hang or the system crash.
> However, all the information we need to triage the bug is in error state - whic
> as you found out is not preserved across reboot. Any chance you can grab it
> before the system lockup/reboot?
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 3 Ricardo 2017-03-03 17:12:45 UTC
information provided by the submitter (logs) moving bug to reopen state
Comment 4 Elizabeth 2017-06-26 21:40:10 UTC
(In reply to Ricardo from comment #3)
> information provided by the submitter (logs) moving bug to reopen state

Good afternoon,
Is this bug still valid? Is the problem still present? If so could you add logs, HW and SW information. Thank you.
Thank you.
Comment 5 Chris Tillman 2017-06-27 18:50:35 UTC
Created attachment 132288 [details]
attachment-14853-0.html

The initial attachments (already attached to the bug) are all I have. I can
report that the root cause of the overheating was found .. an overheating
internal connector in the power supply chain. The connector was replaced,
and thermal grease re-applied to the heat sink, so the machine no longer
has overheating problems.

The point of this bug was not to trouble shoot my machine; the point was
that the available measurements from coretemp were not being heeded. The
logs show that an overtemperature is reported only for a cycle until it
sets back, saying for example

"[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled
(total events = 12172)"
and then almost immediately (12 microseconds) after,
"[57849.614950] CPU1: Core temperature/speed normal"

It appears from the logs that the only response to monitoring is an
immediate reset of the sensor, and that protection of the machine is not
occurring.

Chris

On Tue, Jun 27, 2017 at 9:40 AM, <bugzilla-daemon@freedesktop.org> wrote:

> Elizabeth <elizabethx.de.la.torre.mena@intel.com> changed bug 99611
> <https://bugs.freedesktop.org/show_bug.cgi?id=99611>
> What Removed Added
> Summary GPU hang after over temperature [SNB] GPU hang after over
> temperature
> Status REOPENED NEEDINFO
>
> *Comment # 4 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c4> on
> bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from
> Elizabeth <elizabethx.de.la.torre.mena@intel.com> *
>
> (In reply to Ricardo from comment #3 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c3>)> information provided by the submitter (logs) moving bug to reopen state
>
> Good afternoon,
> Is this bug still valid? Is the problem still present? If so could you add
> logs, HW and SW information. Thank you.
> Thank you.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 6 Elizabeth 2017-07-05 20:22:54 UTC
(In reply to Chris Tillman from comment #5)
> ... available measurements from coretemp were not being heeded. The
> logs show that an overtemperature is reported only for a cycle until it
> sets back, saying for example
> 
> "[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled
> (total events = 12172)"
> and then almost immediately (12 microseconds) after,
> "[57849.614950] CPU1: Core temperature/speed normal"
> 
> It appears from the logs that the only response to monitoring is an
> immediate reset of the sensor, and that protection of the machine is not
> occurring.
> 
Hello Chris,
Although this is in fact a problem, it seems to be more related to the CPU and the coretemp program than to the GPU and DRM, if that's the case there is no much to do for us here. Could you please take some time to check on the community forums for some orientation on what product/component could be causing the problem and change the bug information?
That would be helpful to find a solution on this case: https://01.org/linuxgraphics/community  
Thank you.
Comment 7 Chris Tillman 2017-07-06 07:15:09 UTC
Created attachment 132471 [details]
attachment-19646-0.html

Well, I think I agree with you, but that's not what the log told me to do:

Jan 31 21:12:29 ctillman kernel: [64809.320303] [drm] GPU HANG: ecode
6:0:0x86fafffa, in Xorg [612], reason: Hang on render ring, action:
reset
Jan 31 21:12:29 ctillman kernel: [64809.320305] [drm] GPU hangs can
indicate a bug anywhere in the entire gfx stack, including userspace.
Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] Please file a
_new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] drm/i915
developers can then reassign to the right component if it's not a
kernel issue.
Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] The gpu crash
dump is required to analyze gpu hangs, so please always attach it.
Jan 31 21:12:29 ctillman kernel: [64809.320307] [drm] GPU crash dump
saved to /sys/class/drm/card0/error


I understand if you don't have the dump log that it can be difficult to pin
down. I tried to find a coretemp bug list but couldn't. I'm happy for you
to close it if you can't think of anything else.


On Thu, Jul 6, 2017 at 8:22 AM, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 6 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c6> on
> bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from
> Elizabeth <elizabethx.de.la.torre.mena@intel.com> *
>
> (In reply to Chris Tillman from comment #5 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c5>)> ... available measurements from coretemp were not being heeded. The
> > logs show that an overtemperature is reported only for a cycle until it
> > sets back, saying for example
> >
> > "[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled
> > (total events = 12172)"
> > and then almost immediately (12 microseconds) after,
> > "[57849.614950] CPU1: Core temperature/speed normal"
> >
> > It appears from the logs that the only response to monitoring is an
> > immediate reset of the sensor, and that protection of the machine is not
> > occurring.
> >
> Hello Chris,
> Although this is in fact a problem, it seems to be more related to the CPU and
> the coretemp program than to the GPU and DRM, if that's the case there is no
> much to do for us here. Could you please take some time to check on the
> community forums for some orientation on what product/component could be
> causing the problem and change the bug information?
> That would be helpful to find a solution on this case:https://01.org/linuxgraphics/community
> Thank you.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 8 Ricardo 2017-07-11 14:18:51 UTC
closing bug as is not related to DRM


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.