Bug 89451 - [ivb] system spontaneously reboots on high load
Summary: [ivb] system spontaneously reboots on high load
Status: CLOSED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-05 23:01 UTC by rasmus
Modified: 2017-07-24 22:48 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
output of dmesg captured as journalctl -b-1 after a crash (204.61 KB, text/plain)
2015-03-05 23:01 UTC, rasmus
no flags Details
glxinfo output (18.96 KB, text/plain)
2015-03-05 23:02 UTC, rasmus
no flags Details
intel_reg_dumper output (17.47 KB, text/plain)
2015-03-05 23:02 UTC, rasmus
no flags Details
output of lspci -nn (2.01 KB, text/plain)
2015-03-05 23:02 UTC, rasmus
no flags Details
LIBGL_DEBUG=verbose start_furmark_windowed_1024x640.sh > stdout.txt 2> stderr.txt (832 bytes, text/plain)
2015-03-05 23:03 UTC, rasmus
no flags Details
output of journalctl -b-1 -e _COMM=Xorg.bin after a crash (41.71 KB, text/plain)
2015-03-05 23:04 UTC, rasmus
no flags Details
xorg_crash2.txt output of journalctl -b-1 -e _COMM=Xorg.bin after another crash but with more debug kernel modules. (32.21 KB, text/plain)
2015-03-05 23:04 UTC, rasmus
no flags Details
journalctl -b -e _COMM=Xorg.bin of a system that has not crashed. (37.99 KB, text/plain)
2015-03-05 23:05 UTC, rasmus
no flags Details
netconsole output of crash (17.73 KB, text/plain)
2015-03-06 09:42 UTC, rasmus
no flags Details
output of dmidecode (15.78 KB, text/plain)
2015-03-06 09:58 UTC, rasmus
no flags Details
currently active bios settings (3.25 KB, text/plain)
2015-03-06 10:12 UTC, rasmus
no flags Details
dmidecode output (15.58 KB, text/plain)
2015-03-06 10:15 UTC, rasmus
no flags Details

Description rasmus 2015-03-05 23:01:39 UTC
Created attachment 114045 [details]
output of dmesg captured as journalctl -b-1 after a     crash

I have tried to provide the info requested by Intel and recommended by
Fedora[1].  Kudos to the Fedora folks for providing very detailed
instructions!

If some information is missing please let me know.


1 System environment
════════════════════

  – chipset: HD 4000 with Intel i7 Ivy Bridge system architecture:
  – x86_64 xf86-video-intel: 2.99.91 (2.99.917) xserver: 1.16.3 (1.17.1)
  – mesa: 10.4.3 (10.4.5) libdrm: 2.4.59 (2.4.59) kernel version:
  – 3.18.7-200.fc21.x86_64 (3.18.6-1-ARCH) Linux distribution: Fedora 21
  – (Archlinux) Machine or mobo model: Thinkpad W530 Display connector:
  – laptop screen

  I conduct the test below in Fedora.  I normally use Arch.  The problem
  is present in both distros.


2 Reproduce steps. Probability if not 100% reproducible
═══════════════════════════════════════════════════════

  1. Download gputest from http://www.geeks3d.com/gputest/
  2. Run start_furmark_windowed_1024x640.sh.  Put it in full screen if
     you like.
  3. On my system the computer crashes and reboots typically within 10
     minutes.  The reboot is as if power was cut and returned.  Like if
     the CPU was overheating (which it is not).  IOW: The systemctl
     shutdown logs are not displayed.


3 Additional info
═════════════════

  I experience reboots crashing on my Thinkpad W530 with HD4000 whenever
  the iGPU is exposed to moderate load, e.g. playing a simple video game
  (Shadowrun and Mark of the Ninja are two examples).  This happens on
  my main distro: Arch and on Fedora.  This only happens when I employ
  the Intel iGPU.  It happens irrespective of whether Nvidia Optimus in
  enabled.

  Note: This is seemingly *not* a hardware issue!  I can run mprime cpu
  load indefinitely on the system without a crash.  The temperature
  never goes above 90 degrees when I run any test (and I had the fan
  replaced within the last three months).  My brother who also has a
  Thinkpad W530 experiences the same issue on Debian.

  Importantly: the computer is completely stable on Windows 7, where I
  have stress tested the system with the same procedure as below for 6-8
  hours (latest Intel drivers, Nvidia Optimus disabled).


  I have tested on Fedora 21 and Archlinux.  My brother, who also has a
  W530, has tested on Debian Sid.

  The tests are conducted on a clean Fedora 21 image ‘cause it’s better
  for debugging.


4 Attachments
═════════════

  • dmesg_crash.txt output of dmesg captured as journalctl -b-1 after a
    crash.
  • glxinfo output
  • intel_reg_dumper output
  • output of lspci -nn
  • stdout.txt and stderr.txt: from running
    ┌────
    │ LIBGL_DEBUG=verbose start_furmark_windowed_1024x640.sh > stdout.txt 2> stderr.txt
    └────
  • xorg_crash.txt output of journalctl -b-1 -e _COMM=Xorg.bin after one
    crash
  • xorg_crash2.txt output of journalctl -b-1 -e _COMM=Xorg.bin after
    another crash but with more debug kernel modules.
  • xorg_fine.txt journalctl -b -e _COMM=Xorg.bin when a crash has not
    occurred (I didn’t turn off the system).


5 “Missing” attachments
═══════════════════════

  • Xorg.0.log: for some reason not present.
  • xorg.conf: default setup.
  • the last batch buffer: it shows nothing interesting before the
    reboot.  Is there a way to record is persistently?
  • I’m trying to get a “kdump” of the kernel when it crashes as
    described here[2]



Footnotes
─────────

[1] https://01.org/linuxgraphics/documentation/how-report-bugs
    http://fedoraproject.org/wiki/How_to_debug_Xorg_problems

[2] http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
Comment 1 rasmus 2015-03-05 23:02:08 UTC
Created attachment 114046 [details]
glxinfo output
Comment 2 rasmus 2015-03-05 23:02:27 UTC
Created attachment 114047 [details]
intel_reg_dumper output
Comment 3 rasmus 2015-03-05 23:02:42 UTC
Created attachment 114048 [details]
output of lspci -nn
Comment 4 rasmus 2015-03-05 23:03:07 UTC
Created attachment 114049 [details]
LIBGL_DEBUG=verbose start_furmark_windowed_1024x640.sh > stdout.txt 2> stderr.txt
Comment 5 rasmus 2015-03-05 23:04:27 UTC
Created attachment 114050 [details]
output of journalctl -b-1 -e _COMM=Xorg.bin after a crash
Comment 6 rasmus 2015-03-05 23:04:51 UTC
Created attachment 114051 [details]
xorg_crash2.txt output of journalctl -b-1 -e _COMM=Xorg.bin after another crash but with more debug kernel modules.
Comment 7 rasmus 2015-03-05 23:05:35 UTC
Created attachment 114052 [details]
journalctl -b -e _COMM=Xorg.bin of a system that has not crashed.
Comment 8 rasmus 2015-03-05 23:11:58 UTC
If it is of any relevance, I have discussed the issue in this thread on the interwebs: http://forum.thinkpads.com/viewtopic.php?f=70&t=116472.
Comment 9 rasmus 2015-03-05 23:16:34 UTC
I tried to get a kdump following the Fedora wiki instructions, but nothing is saved to /var/dumps.  Sorry.
Comment 10 Chris Wilson 2015-03-06 08:05:33 UTC
System reboot is a processor event. A GPU failure just kills the system - I have yet to hear of one that could cause a spontaneous reboot.

I would suggest you try setting up netconsole.
Comment 11 rasmus 2015-03-06 09:42:59 UTC
Created attachment 114076 [details]
netconsole output of crash
Comment 12 rasmus 2015-03-06 09:50:18 UTC
Hi Chris

> System reboot is a processor event. A GPU failure just kills the system - I have yet to hear of one that could cause a spontaneous reboot.

I don't understand most of what you are saying above.  I'm a merely a *user* of software and hardware.

I can record a video of the screen if that helps.

> I would suggest you try setting up netconsole.

I have attached the requested output now.
Comment 13 Chris Wilson 2015-03-06 09:50:54 UTC
Looks very, very suspicious. The reboot is not at the OS level, so down to firmware. Look at your BIOS settings and version.
Comment 14 rasmus 2015-03-06 09:58:17 UTC
Created attachment 114077 [details]
output of dmidecode
Comment 15 rasmus 2015-03-06 09:59:31 UTC
Ups, I didn't delete serial number (warranty) from the dmidecode file.  Could you delete it?
Comment 16 rasmus 2015-03-06 10:12:52 UTC
Created attachment 114079 [details]
currently active bios settings
Comment 17 rasmus 2015-03-06 10:14:09 UTC
I attached my current bios settings.  I skipped a couple of section of the bios, but hopefully the info you need is there.
Comment 18 rasmus 2015-03-06 10:15:37 UTC
Created attachment 114080 [details]
dmidecode output
Comment 19 rasmus 2015-03-06 10:57:24 UTC
Chris,

> The reboot is not at the OS level, so down to firmware. 

Just so I know how to proceed.  How should I interpret the above statement? Should I try to get in touch with Lenovo engineers? (I don't know that they have got any open channels).

Again, there's no issue on W7, which is why I suspected the Linux drivers.
Comment 20 rasmus 2015-03-07 17:48:48 UTC
I do not think this is a firmware bug.  Rather, I think it's a bug in Linux or Xorg+friends.

I have run the test specified in "2 Reproduce steps" with Fedora 20 and Fedora 19 (ISO, no updates).  In Fedora 20 the problem is present.  In Fedora 19 I ran the test for approximately 3 hours without any reboots.  Of course that does not mean that the bug isn't present, but in Fedora {20, 21} and Arch the reboot usually occurs within 10 minutes.
Comment 21 Chris Wilson 2015-03-07 21:15:22 UTC
Your dmesg does not show a controlled shutdown. A GPU hang, even a lowlevel hardware hang, should not result in the machine rebooting. You dmesg does show that the kernel disagrees with the ACPI firmware implementation and that its actively managing the thermal throttling. At this point, your best bet is to bisect the kernel and see where that leads.
Comment 22 rasmus 2015-03-08 02:03:41 UTC
So to understand: your claim is that there's no bug in the Intel drivers, but there's a bug in Linux?

By now Fedora 19 (iso version) has been running for 8 hours.  So almost certainly something was introduced after Fedora 19 that causes the reboots.
Comment 23 rasmus 2015-03-08 02:23:12 UTC
Also, how would I bisec the kernel in this case?  The error involves a pretty big crash.  I would appreciate hints on how to write a bisec program that would involve (a) potential reboots; and (b) upgrading the kernel.
Comment 24 rasmus 2015-03-08 14:06:24 UTC
> At this point, your best bet is to bisect the kernel and see where that leads.

The bug does not seem to be present in Linux 3.9 (the system ran Furmark for 18 hours).  In Linux 3.10 the system crashed within 10 minutes.  The rest of the Xorg-stack was the current one (from Arch repos).

Perhaps this is not a driver issue after all.  I guess I will try to open a bug report with Linux, though "between 3.9 and 3.10" is still terribly inaccurate...
Comment 25 rasmus 2015-03-08 14:26:27 UTC
Any idea where in Linux the bug might be?  So that I can pass it on to the right maintainer?
Comment 26 rasmus 2015-03-08 17:36:29 UTC
Reported on the Linux bugzilla here:

https://bugzilla.kernel.org/show_bug.cgi?id=94551


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.