Created attachment 114045 [details] output of dmesg captured as journalctl -b-1 after a crash I have tried to provide the info requested by Intel and recommended by Fedora[1]. Kudos to the Fedora folks for providing very detailed instructions! If some information is missing please let me know. 1 System environment ════════════════════ – chipset: HD 4000 with Intel i7 Ivy Bridge system architecture: – x86_64 xf86-video-intel: 2.99.91 (2.99.917) xserver: 1.16.3 (1.17.1) – mesa: 10.4.3 (10.4.5) libdrm: 2.4.59 (2.4.59) kernel version: – 3.18.7-200.fc21.x86_64 (3.18.6-1-ARCH) Linux distribution: Fedora 21 – (Archlinux) Machine or mobo model: Thinkpad W530 Display connector: – laptop screen I conduct the test below in Fedora. I normally use Arch. The problem is present in both distros. 2 Reproduce steps. Probability if not 100% reproducible ═══════════════════════════════════════════════════════ 1. Download gputest from http://www.geeks3d.com/gputest/ 2. Run start_furmark_windowed_1024x640.sh. Put it in full screen if you like. 3. On my system the computer crashes and reboots typically within 10 minutes. The reboot is as if power was cut and returned. Like if the CPU was overheating (which it is not). IOW: The systemctl shutdown logs are not displayed. 3 Additional info ═════════════════ I experience reboots crashing on my Thinkpad W530 with HD4000 whenever the iGPU is exposed to moderate load, e.g. playing a simple video game (Shadowrun and Mark of the Ninja are two examples). This happens on my main distro: Arch and on Fedora. This only happens when I employ the Intel iGPU. It happens irrespective of whether Nvidia Optimus in enabled. Note: This is seemingly *not* a hardware issue! I can run mprime cpu load indefinitely on the system without a crash. The temperature never goes above 90 degrees when I run any test (and I had the fan replaced within the last three months). My brother who also has a Thinkpad W530 experiences the same issue on Debian. Importantly: the computer is completely stable on Windows 7, where I have stress tested the system with the same procedure as below for 6-8 hours (latest Intel drivers, Nvidia Optimus disabled). I have tested on Fedora 21 and Archlinux. My brother, who also has a W530, has tested on Debian Sid. The tests are conducted on a clean Fedora 21 image ‘cause it’s better for debugging. 4 Attachments ═════════════ • dmesg_crash.txt output of dmesg captured as journalctl -b-1 after a crash. • glxinfo output • intel_reg_dumper output • output of lspci -nn • stdout.txt and stderr.txt: from running ┌──── │ LIBGL_DEBUG=verbose start_furmark_windowed_1024x640.sh > stdout.txt 2> stderr.txt └──── • xorg_crash.txt output of journalctl -b-1 -e _COMM=Xorg.bin after one crash • xorg_crash2.txt output of journalctl -b-1 -e _COMM=Xorg.bin after another crash but with more debug kernel modules. • xorg_fine.txt journalctl -b -e _COMM=Xorg.bin when a crash has not occurred (I didn’t turn off the system). 5 “Missing” attachments ═══════════════════════ • Xorg.0.log: for some reason not present. • xorg.conf: default setup. • the last batch buffer: it shows nothing interesting before the reboot. Is there a way to record is persistently? • I’m trying to get a “kdump” of the kernel when it crashes as described here[2] Footnotes ───────── [1] https://01.org/linuxgraphics/documentation/how-report-bugs http://fedoraproject.org/wiki/How_to_debug_Xorg_problems [2] http://fedoraproject.org/wiki/How_to_use_kdump_to_debug_kernel_crashes
Created attachment 114046 [details] glxinfo output
Created attachment 114047 [details] intel_reg_dumper output
Created attachment 114048 [details] output of lspci -nn
Created attachment 114049 [details] LIBGL_DEBUG=verbose start_furmark_windowed_1024x640.sh > stdout.txt 2> stderr.txt
Created attachment 114050 [details] output of journalctl -b-1 -e _COMM=Xorg.bin after a crash
Created attachment 114051 [details] xorg_crash2.txt output of journalctl -b-1 -e _COMM=Xorg.bin after another crash but with more debug kernel modules.
Created attachment 114052 [details] journalctl -b -e _COMM=Xorg.bin of a system that has not crashed.
If it is of any relevance, I have discussed the issue in this thread on the interwebs: http://forum.thinkpads.com/viewtopic.php?f=70&t=116472.
I tried to get a kdump following the Fedora wiki instructions, but nothing is saved to /var/dumps. Sorry.
System reboot is a processor event. A GPU failure just kills the system - I have yet to hear of one that could cause a spontaneous reboot. I would suggest you try setting up netconsole.
Created attachment 114076 [details] netconsole output of crash
Hi Chris > System reboot is a processor event. A GPU failure just kills the system - I have yet to hear of one that could cause a spontaneous reboot. I don't understand most of what you are saying above. I'm a merely a *user* of software and hardware. I can record a video of the screen if that helps. > I would suggest you try setting up netconsole. I have attached the requested output now.
Looks very, very suspicious. The reboot is not at the OS level, so down to firmware. Look at your BIOS settings and version.
Created attachment 114077 [details] output of dmidecode
Ups, I didn't delete serial number (warranty) from the dmidecode file. Could you delete it?
Created attachment 114079 [details] currently active bios settings
I attached my current bios settings. I skipped a couple of section of the bios, but hopefully the info you need is there.
Created attachment 114080 [details] dmidecode output
Chris, > The reboot is not at the OS level, so down to firmware. Just so I know how to proceed. How should I interpret the above statement? Should I try to get in touch with Lenovo engineers? (I don't know that they have got any open channels). Again, there's no issue on W7, which is why I suspected the Linux drivers.
I do not think this is a firmware bug. Rather, I think it's a bug in Linux or Xorg+friends. I have run the test specified in "2 Reproduce steps" with Fedora 20 and Fedora 19 (ISO, no updates). In Fedora 20 the problem is present. In Fedora 19 I ran the test for approximately 3 hours without any reboots. Of course that does not mean that the bug isn't present, but in Fedora {20, 21} and Arch the reboot usually occurs within 10 minutes.
Your dmesg does not show a controlled shutdown. A GPU hang, even a lowlevel hardware hang, should not result in the machine rebooting. You dmesg does show that the kernel disagrees with the ACPI firmware implementation and that its actively managing the thermal throttling. At this point, your best bet is to bisect the kernel and see where that leads.
So to understand: your claim is that there's no bug in the Intel drivers, but there's a bug in Linux? By now Fedora 19 (iso version) has been running for 8 hours. So almost certainly something was introduced after Fedora 19 that causes the reboots.
Also, how would I bisec the kernel in this case? The error involves a pretty big crash. I would appreciate hints on how to write a bisec program that would involve (a) potential reboots; and (b) upgrading the kernel.
> At this point, your best bet is to bisect the kernel and see where that leads. The bug does not seem to be present in Linux 3.9 (the system ran Furmark for 18 hours). In Linux 3.10 the system crashed within 10 minutes. The rest of the Xorg-stack was the current one (from Arch repos). Perhaps this is not a driver issue after all. I guess I will try to open a bug report with Linux, though "between 3.9 and 3.10" is still terribly inaccurate...
Any idea where in Linux the bug might be? So that I can pass it on to the right maintainer?
Reported on the Linux bugzilla here: https://bugzilla.kernel.org/show_bug.cgi?id=94551
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.