Created attachment 78455 [details] Demonstration of X server / OpenGL lockup with Haswell graphics Some, but not all, combinations of various X and OpenGL programs running on Haswell built-in graphics (using the "intel" video driver) can cause the X server to lock up solid, no response from the mouse and no ability to switch to a virtual terminal. The system used is an HP prototype with Haswell CPU and Lynx Point PCH. Attached you find a shell script that runs programs found in the xscreensavers package that installs by default. This script typically yields the lockup within a few minutes. It does create a window that is 1280x1024, so won't run as is on lower resolutions. If you modify it, the principle is that the two smaller windows should lie within the larger one, in the vertical dimension. Driver stack is: - xorg-server 1.6.5 - xf86-video-intel 2.21.4 - Mesa-9.0.3 - libdrm-2.4.41 - Kernel is from Daniell Vetter's git branch (git://people.freedesktop.org/~danvet/drm-intel) branch: drm-intel-testing Latest commit: commit 80ad9206c0d863832bc5f6008c4d1868d1df8e70 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Fri Apr 19 14:36:51 2013 +0300 drm/i915: Make struct dpll == intel_clock_t GPU: Vendor: pci 0x8086 "Intel Corporation" Device: pci 0x041a "Haswell GT2 S"
Let's start with dmesg, Xorg.0.log and lspci. When X locks, is the machine remotely accessible? Note that 2.21.4 is old, and if you are using SNA contains a known bug with scanline waits which would be triggered in this scenario.
I've forgot to mention. This sample script is running on top of a compiz session (compiz version 0.7.8). "UXA" is in use here. It takes longer to catch the lockup when disabling compiz. "SNA" apparently isn't supported for Haswell. Xserver fails to start (more or less silently).
Unfortunately the machine is no longer remotely accessible. Network adapter even breaks network for machines connected to the same switch. LOL. (but blacklisting the ethernet driver doesn't help; tried that already, it's not the culprit). I'm going to attach the requested files.
SNA is definitely supported on Haswell - the instability there is more than likely the same root cause. Using UXA on Haswell implies that you only have quite a few missing features, which helps to rule out the ddx as being the source of your troubles.
Created attachment 78460 [details] Xorg.0.log
Created attachment 78461 [details] lspci.txt
Created attachment 78469 [details] dmesg.log Initial dmesg output. After starting the sample script I see this via dmesg until the machine freezes: [ 114.824506] [drm:i915_driver_open], [ 114.850319] [drm:i915_gem_context_create_ioctl], HW context 1 created [ 119.747050] [drm:i915_driver_open], [ 119.820812] [drm:i915_gem_context_create_ioctl], HW context 1 created [ 129.742931] [drm:i915_driver_open], [ 129.763661] [drm:i915_gem_context_create_ioctl], HW context 1 created [ 149.748183] [drm:i915_driver_open], [ 149.769426] [drm:i915_gem_context_create_ioctl], HW context 1 created [ 169.766972] [drm:i915_driver_open], [ 169.783286] [drm:i915_gem_context_create_ioctl], HW context 1 created
This has been with drm.debug=0xe set.
I've experienced two hard hangs now running that script. Will try to narrow down the cause over the next few days.
Still hangs with i915.i915_enable_rc6=0.
Disabling hw contexts is ineffective.
i915.reset=0 is no defence.
i915.semaphores=0 still hangs.
i915.i915_enable_ppgtt=0 still hangs. I think I am at the end of the kernel tunables. :|
Disabling acceleration in the ddx (even the generic BLT code) prevents the hang - though I strongly believe that to be a timing artifact.
My issue appears to stem from the DPMS kicking in - with xset s 0 -dpms it is still running...
And then to confirm: immediate hang after several hours of runtime by executing xset dpms force off.
So it appears not to be the dpms off that is the trigger, but the frequency of the GL rendering: vblank_mode=0 ./bug.sh reproduces the bug quickly.
Also no squawk over netconsole. :|
Adding more people to this fun here ...
Drumroll please... =0 haswell:/opt/xorg/src/mesa/mesa (master)$ git diff diff --git a/src/mesa/drivers/dri/intel/intel_context.c b/src/mesa/drivers/dri/i index 0a1dd75..65f8738 100644 --- a/src/mesa/drivers/dri/intel/intel_context.c +++ b/src/mesa/drivers/dri/intel/intel_context.c @@ -704,7 +704,7 @@ intelInitContext(struct intel_context *intel, intel->has_separate_stencil = intel->intelScreen->hw_has_separate_stencil; intel->must_use_separate_stencil = intel->intelScreen->hw_must_use_separate_ - intel->has_hiz = intel->gen >= 6; + intel->has_hiz = intel->gen >= 6 && 0; intel->has_llc = intel->intelScreen->hw_has_llc; intel->has_swizzling = intel->intelScreen->hw_has_swizzling;
Wow! Thanks a lot, Chris! Good that you tested also with "outdated" Mesa sources (and thus could reproduce), since apparently this issue has already been addressed in current git. ;-) commit 1ba8c6ad03a3f03ecc6b66e1c0e10a4d6010122f Author: Kenneth Graunke <kenneth@whitecape.org> Date: Wed Mar 7 10:16:00 2012 -0800 i965: Disable HiZ on Haswell for now. Getting HiZ working means updating all the state packets for resolves and clears. It's not worth doing until we get the basics working. Signed-off-by: Kenneth Graunke <kenneth@whitecape.org> Reviewed-by: Eric Anholt <eric@anholt.net> diff --git a/src/mesa/drivers/dri/intel/intel_context.c b/src/mesa/drivers/dri/intel/intel_context.c index fd5f0b6..1aa2e9a 100644 --- a/src/mesa/drivers/dri/intel/intel_context.c +++ b/src/mesa/drivers/dri/intel/intel_context.c @@ -629,7 +629,7 @@ intelInitContext(struct intel_context *intel, intel->has_separate_stencil = intel->intelScreen->hw_has_separate_stencil; intel->must_use_separate_stencil = intel->intelScreen->hw_must_use_separate_stencil; - intel->has_hiz = intel->gen >= 6; + intel->has_hiz = intel->gen >= 6 && !intel->is_haswell; intel->has_llc = intel->intelScreen->hw_has_llc; intel->has_swizzling = intel->intelScreen->hw_has_swizzling;
However it has been enabled again: commit e4484a0309ab44a790df29a599fb2b01eb885d5a Author: Chad Versace <chad.versace@linux.intel.com> Date: Fri Apr 5 16:35:47 2013 -0700 intel/hsw: Enable hiz (v2) Enable hiz by setting intel_context::has_hiz. However, to work around a hardware bug, we selectively enable hiz for only nicely aligned miptree slices.
Indeed I was proven wrong. The issue is, we're still testing with intel->has_hiz = intel->gen >= 6 && !intel->is_haswell; (Mesa 9.0.3), so HiZ was already disabled for us. :-(
Chris, are you testing with "UXA" or "SNA"? We're still using "UXA" as default.
Hi, I submitted the initial issue to SUSE. I would like to ask whether this problem has been observed on hardware implementations other than the HP prototype that was initially mentioned. From the bugzilla it looked like maybe remote access to the HP proto was used, at least initially. I have observed that some time after the hang, on our prototype system, the LED indicating Catastrophic Error (CATERR#) comes on, and stays on until the reboot. I'm wondering if that has also been observed on other implementations (if there is a way to tell). This CATERR may be a normal consequence of getting stuck, just wanted to add the note.
John, I could reproduce that issue also on a Haswell ULT laptop.
I think I'm seeing this bug as well (Gentoo, Linux 3.9.5, Mesa 9.1.3, intel driver 2.21.9 with SNA, xorg-desktop Haswell i7-4770k).
(In reply to comment #25) > Chris, are you testing with "UXA" or "SNA"? We're still using "UXA" as > default. Chris, unfortunately I didn't hear back from you. :-(
Ok, I've managed to reproduce this using a Haswell ULT system, with a stock install of Arch, using UXA under a gnome desktop. The components are: Mesa: 9.1.3-2 Kernel: 3.9.7-1 xf86-video-intel: 2.21.10-2 xserver: 1.14.2-1 I ran the test with vblank_mode=0, and it failed after running for 16 min. Mesa 9.1.3 does not use HiZ on Haswell, so this seems to rule out HiZ as the culprit. Also, my use of gnome seems to rule out compiz as the source of the problem. I can confirm Stefan's observation that this is a hard lockup--there is no response from the mouse and no ability to switch to a virtual terminal. I'll continue investigating and post updates as I discover more.
Ok, here's what I've found: - I've reproduced the bug 7 times, with the amount of time to failure varying wildly (I've seen 3m*, 8m*, 11m*, 16m, 2h48m, and 4h03m, and one failure where I failed to record the amount of time). - Note that the times shown above with "*" occurred today, after I upgraded the BIOS on my HSW ULT from version 113 to 126 (and upgraded the KSC EC version from 1.20 to 1.24). The others occurred earlier in the week. This makes me suspicious that the problem may be BIOS-related, since the failures seem to be more frequent since the BIOS upgrade. - Each time I've reproduced the bug I was running with vblank_mode=0. - I've reproduced the bug both with the stock Arch kernel (3.9.7-1) and with drm-intel-nightly (00b224eee). - I've reproduced the bug with "iommu=off" on the kernel command line. - Each time the bug occurs, the computer locks up completely: you can't switch VT's, pressing NumLock fails to toggle the NumLock light, and the machine is unresponsive via ssh. This is not a simple GPU hang. - Contrary to comment 16, DPMS does not seem to be involved: I tried running "while true; do sleep 10; xset dpms force off; sleep 10; xset dpms force on; done" in parallel with the script, and it did not noticeably increase the rate of failure. - I also investigated whether this might be a thermal issue. One of my failures occurred with a very hot CPU (due to my not setting up fan settings correctly after a BIOS upgrade), but another occurred at a temperature of 61C, which is well within normal operating range. So I believe it is not a thermal issue. At this point I believe this is most likely a kernel, BIOS, or hardware bug, and it needs investigation by someone with kernel expertise. I haven't seen any evidence that it's related to Mesa (which is where my expertise lies). So I'm reassigning to intel-gfx-bugs@lists.freedesktop.org.
For me, disabling hiz still makes the difference between the script dying after 1 cycle (after the first kill i.e. less than a minute) and running for an indetermined period.
Chris, I'm afraid I need to ask again. :-( Your test results are with 'sna' or 'UXA'?
As before, unsurprisingly, the failures I see are with both UXA and SNA.
It wasn't obvious to me, since you didn't mention it before and I'm aware that SNA development is one of your main tasks at Intel. I'm sorry. But this means that Paul Berry/me and you are either using a different software stack or different hardware. Oh well ...
I hit something running this over the weekend. Somehow I ended up with the giant octopus screensaver running when it hung. Noting in netconsole, machine seems totally dead. The last log message I had was at 50m in, about mce events (I get a bunch of thermal MCE events always). However, I know the test was running for longer than that before I left. If someone out there can reproduce this fairly quickly, on a hunch, can you try lowering the max GPU freq to the rp1 value? Stefan?
On my board, this appears to be an unhandled MCE exception which I've determined via the CATERR signal on the motherboard. Chris can you confirm if this is what you see too?
Hmm. I just hit the hang, much quicker, and without CATERR.
(In reply to comment #38) > Hmm. I just hit the hang, much quicker, and without CATERR. I guess I hit send too soon. What actually happens is CATERR lights up about a minute after the hang. I'm thinking this is because the MCE handler cannot run after the system is hard hung.
So I've hooked up an ITP now, and not sure if it's related, but 3 times in a row my machine has powered down instead of hanging. To me this really smells like a thermal issue.
It does also fail eventually with hiz disabled. Even with max_freq=RP1, it fails very very quickly with hiz enabled, and eventually without.
I too was able to hit it at min frequency. I have punted this one to internal HW/validation teams. If anyone has any surefire ways to either reduce the complexity of the test, or reduce the time to failure, please let me know. Chris' HiZ trick doesn't work on my machine.
Running intel-gpu-top + kde, I can now hit it quite fast: #!/bin/bash # demonstration of X server / OpenGL lockup on SLED SP3 Beta 2 # with Haswell graphics. X server needs to be using "intel" driver. export LIBGL_DRIVERS_PATH=/home/test/mesa/lib/ prog1=hypertorus prog2=antmaze prog3=phosphor konsole --noclose -e sudo ./intel-gpu-tools/tools/intel_gpu_top cd /usr/lib64/xscreensaver vblank_mode=0 ./$prog1 -geometry 1280x1024+10+10 -delay 0 & sleep 5 vblank_mode=0 ./$prog2 -geometry 1024x768+25+25 -delay 0 & pid1=$! mode=0 while true do if [ $mode -eq 0 ] then vblank_mode=0 ./$prog3 -geometry 1024x768+425+25 -delay 0 & pid2=$! else vblank_mode=0 ./$prog2 -geometry 1024x768+25+25 -delay 0 & pid2=$! fi mode=$(( ( $mode + 1 ) % 2 )) # echo $mode sleep 10 kill $pid1 pid1=$pid2 done
Created attachment 82350 [details] [review] serialise most register access A hack for testing
More complete series that serialises all register access: https://patchwork.kernel.org/patch/2827033/ https://patchwork.kernel.org/patch/2827034/ https://patchwork.kernel.org/patch/2827032/ https://patchwork.kernel.org/patch/2827031/
The serialization patch in comment 44 seems helping. The tests went well, so far. Though, thinking of backporting to stable kernels (yes, we need it!), the patch series as in comment 45 isn't appropriate. It makes the patch backporting quite hard due to code juggling. Can it be at first a simple fix that is easily applicable to older kernels, then cleanup / code shuffle patch series, instead, please?
The patch https://bugs.freedesktop.org/attachment.cgi?id=82350 is what I would suggest a backport applies, it will fix almost all likely issues. But you really do need to juggle the code around in order to serialise all register access (even if you do not move it into a separate file).
I try the patch series in comment 45 with latest drm-intel-fixes(7dcd2677e) on our haswell laptop, no hang happened after running about 1 hours. I also test the same kernel without these patches, it hangs in a minute.
Step 1: commit a7cd1b8fea2f341b626b255d9898a5ca5fabbf0a Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jul 19 20:36:51 2013 +0100 drm/i915: Serialize almost all register access
Stefan, Fangxun, how is it going?
The duct-tape is now in -fixes and just recently. I've merged the full solution from Chris to dinq. So I think we can close this. Thanks for reporting this issue and please reopen if it blows up again.
It works fine with latest drm-next-fixes kernel(61c254) on haswell laptop.
I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my machine very reliably running the Ben's script in Comment 43. Am I just on the wrong branch or is this not entirely fixed? For the record, I have a desktop GT2 (on the i7-4770): $ lspci -nn | grep VGA 00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06)
(In reply to comment #53) > I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my > machine very reliably running the Ben's script in Comment 43. Am I just on > the wrong branch or is this not entirely fixed? > > For the record, I have a desktop GT2 (on the i7-4770): > $ lspci -nn | grep VGA > 00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 > v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06) Running intel-gpu-top concurrently with the i915.ko kernel driver is known to kill the system. If you can still crash your system without that, then that's something entirely different.
(In reply to comment #53) > I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my > machine very reliably running the Ben's script in Comment 43. Am I just on > the wrong branch or is this not entirely fixed? > > For the record, I have a desktop GT2 (on the i7-4770): > $ lspci -nn | grep VGA > 00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 > v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06) The script I posted is now known to be exercise a HW bug. http://lists.freedesktop.org/archives/intel-gfx/2013-July/030188.html I am in favor of disabling GPU top for HSW without a special parameter FWIW
I'm not a fan of disabling it. intel_gpu_top has been rock solid on my Haswell, at least...much better than on Ivybridge...
I see. Without intel_gpu_top, I have not seen any hangs so far, neither with the test case nor with just using the machine (whereas it would not survive two days of uptime without the fixes). Thanks for the clarification.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.