Created attachment 129247 [details] syslog as system was overheating and recovering, then gpu crashed. root@ctillman:/home/chris# grep -i GPU /var/log/syslog Jan 31 21:12:29 ctillman kernel: [64809.320303] [drm] GPU HANG: ecode 6:0:0x86fafffa, in Xorg [612], reason: Hang on render ring, action: reset Jan 31 21:12:29 ctillman kernel: [64809.320305] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jan 31 21:12:29 ctillman kernel: [64809.320307] [drm] GPU crash dump saved to /sys/class/drm/card0/error Jan 31 21:12:29 ctillman kernel: [64809.320336] drm/i915: Resetting chip after gpu hang Jan 31 21:25:30 ctillman kernel: [ 9.156348] RAPL PMU: hw unit of domain pp1-gpu 2^-16 Joules The error log just said "no error state collected". The user experience was a complete freeze of the computer requiring power off.
The overheating is unlikely to be causal for the GPU hang or the system crash. However, all the information we need to triage the bug is in error state - whic as you found out is not preserved across reboot. Any chance you can grab it before the system lockup/reboot?
Created attachment 129259 [details] attachment-5677-0.html As I also reported, the computer was completely non-responsive when the crash occurred. No mouse, No Ctrl-Alt F keys response to access virtual consoles, no response to Ctrl-Shift Backspace. So there was no chance to investigate further. I can try to check it out if I notice the laptop making a lot of noise and feeling hot to the touch again. Any chance of making a change so it records necessary information on disk which survives reboots? Perhaps in my case, a script which records something every hour or minute? On Tue, Jan 31, 2017 at 11:10 PM, <bugzilla-daemon@freedesktop.org> wrote: > Chris Wilson <chris@chris-wilson.co.uk> changed bug 99611 > <https://bugs.freedesktop.org/show_bug.cgi?id=99611> > What Removed Added > Status NEW NEEDINFO > > *Comment # 1 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c1> on > bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from Chris > Wilson <chris@chris-wilson.co.uk> * > > The overheating is unlikely to be causal for the GPU hang or the system crash. > However, all the information we need to triage the bug is in error state - whic > as you found out is not preserved across reboot. Any chance you can grab it > before the system lockup/reboot? > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
information provided by the submitter (logs) moving bug to reopen state
(In reply to Ricardo from comment #3) > information provided by the submitter (logs) moving bug to reopen state Good afternoon, Is this bug still valid? Is the problem still present? If so could you add logs, HW and SW information. Thank you. Thank you.
Created attachment 132288 [details] attachment-14853-0.html The initial attachments (already attached to the bug) are all I have. I can report that the root cause of the overheating was found .. an overheating internal connector in the power supply chain. The connector was replaced, and thermal grease re-applied to the heat sink, so the machine no longer has overheating problems. The point of this bug was not to trouble shoot my machine; the point was that the available measurements from coretemp were not being heeded. The logs show that an overtemperature is reported only for a cycle until it sets back, saying for example "[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled (total events = 12172)" and then almost immediately (12 microseconds) after, "[57849.614950] CPU1: Core temperature/speed normal" It appears from the logs that the only response to monitoring is an immediate reset of the sensor, and that protection of the machine is not occurring. Chris On Tue, Jun 27, 2017 at 9:40 AM, <bugzilla-daemon@freedesktop.org> wrote: > Elizabeth <elizabethx.de.la.torre.mena@intel.com> changed bug 99611 > <https://bugs.freedesktop.org/show_bug.cgi?id=99611> > What Removed Added > Summary GPU hang after over temperature [SNB] GPU hang after over > temperature > Status REOPENED NEEDINFO > > *Comment # 4 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c4> on > bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from > Elizabeth <elizabethx.de.la.torre.mena@intel.com> * > > (In reply to Ricardo from comment #3 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c3>)> information provided by the submitter (logs) moving bug to reopen state > > Good afternoon, > Is this bug still valid? Is the problem still present? If so could you add > logs, HW and SW information. Thank you. > Thank you. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
(In reply to Chris Tillman from comment #5) > ... available measurements from coretemp were not being heeded. The > logs show that an overtemperature is reported only for a cycle until it > sets back, saying for example > > "[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled > (total events = 12172)" > and then almost immediately (12 microseconds) after, > "[57849.614950] CPU1: Core temperature/speed normal" > > It appears from the logs that the only response to monitoring is an > immediate reset of the sensor, and that protection of the machine is not > occurring. > Hello Chris, Although this is in fact a problem, it seems to be more related to the CPU and the coretemp program than to the GPU and DRM, if that's the case there is no much to do for us here. Could you please take some time to check on the community forums for some orientation on what product/component could be causing the problem and change the bug information? That would be helpful to find a solution on this case: https://01.org/linuxgraphics/community Thank you.
Created attachment 132471 [details] attachment-19646-0.html Well, I think I agree with you, but that's not what the log told me to do: Jan 31 21:12:29 ctillman kernel: [64809.320303] [drm] GPU HANG: ecode 6:0:0x86fafffa, in Xorg [612], reason: Hang on render ring, action: reset Jan 31 21:12:29 ctillman kernel: [64809.320305] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jan 31 21:12:29 ctillman kernel: [64809.320306] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jan 31 21:12:29 ctillman kernel: [64809.320307] [drm] GPU crash dump saved to /sys/class/drm/card0/error I understand if you don't have the dump log that it can be difficult to pin down. I tried to find a coretemp bug list but couldn't. I'm happy for you to close it if you can't think of anything else. On Thu, Jul 6, 2017 at 8:22 AM, <bugzilla-daemon@freedesktop.org> wrote: > *Comment # 6 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c6> on > bug 99611 <https://bugs.freedesktop.org/show_bug.cgi?id=99611> from > Elizabeth <elizabethx.de.la.torre.mena@intel.com> * > > (In reply to Chris Tillman from comment #5 <https://bugs.freedesktop.org/show_bug.cgi?id=99611#c5>)> ... available measurements from coretemp were not being heeded. The > > logs show that an overtemperature is reported only for a cycle until it > > sets back, saying for example > > > > "[57849.613938] CPU1: Core temperature above threshold, cpu clock throttled > > (total events = 12172)" > > and then almost immediately (12 microseconds) after, > > "[57849.614950] CPU1: Core temperature/speed normal" > > > > It appears from the logs that the only response to monitoring is an > > immediate reset of the sensor, and that protection of the machine is not > > occurring. > > > Hello Chris, > Although this is in fact a problem, it seems to be more related to the CPU and > the coretemp program than to the GPU and DRM, if that's the case there is no > much to do for us here. Could you please take some time to check on the > community forums for some orientation on what product/component could be > causing the problem and change the bug information? > That would be helpful to find a solution on this case:https://01.org/linuxgraphics/community > Thank you. > > ------------------------------ > You are receiving this mail because: > > - You reported the bug. > >
closing bug as is not related to DRM
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.