Bug 63914 - [hsw] Cycling between GL/X rendering causes a hard hang
Summary: [hsw] Cycling between GL/X rendering causes a hard hang
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high critical
Assignee: Ben Widawsky
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-25 10:27 UTC by Stefan Dirsch
Modified: 2017-07-24 22:58 UTC (History)
8 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Demonstration of X server / OpenGL lockup with Haswell graphics (589 bytes, text/plain)
2013-04-25 10:27 UTC, Stefan Dirsch
no flags Details
Xorg.0.log (18.07 KB, text/plain)
2013-04-25 11:05 UTC, Stefan Dirsch
no flags Details
lspci.txt (5.75 KB, text/plain)
2013-04-25 11:07 UTC, Stefan Dirsch
no flags Details
dmesg.log (113.64 KB, text/plain)
2013-04-25 12:54 UTC, Stefan Dirsch
no flags Details
serialise most register access (1.79 KB, patch)
2013-07-12 00:08 UTC, Chris Wilson
no flags Details | Splinter Review

Description Stefan Dirsch 2013-04-25 10:27:12 UTC
Created attachment 78455 [details]
Demonstration of X server / OpenGL lockup with Haswell graphics

Some, but not all, combinations of various X and OpenGL programs running on
Haswell built-in graphics (using the "intel" video driver) can cause the X
server to lock up solid, no response from the mouse and no ability to switch to
a virtual terminal.

The system used is an HP prototype with Haswell CPU and Lynx Point PCH. 

Attached you find a shell script that runs programs found in the xscreensavers
package that installs by default.  This script typically yields the lockup
within a few minutes.  It does create a window that is 1280x1024, so won't run
as is on lower resolutions.  If you modify it, the principle is that the two
smaller windows should lie within the larger one, in the vertical dimension.

Driver stack is:

- xorg-server 1.6.5
- xf86-video-intel 2.21.4
- Mesa-9.0.3
- libdrm-2.4.41
- Kernel is from Daniell Vetter's git branch
  (git://people.freedesktop.org/~danvet/drm-intel)
  branch: drm-intel-testing
  Latest commit:
  commit 80ad9206c0d863832bc5f6008c4d1868d1df8e70
  Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
  Date:   Fri Apr 19 14:36:51 2013 +0300

    drm/i915: Make struct dpll == intel_clock_t

GPU:
  Vendor: pci 0x8086 "Intel Corporation"
  Device: pci 0x041a "Haswell GT2 S"
Comment 1 Chris Wilson 2013-04-25 10:31:14 UTC
Let's start with dmesg, Xorg.0.log and lspci. When X locks, is the machine remotely accessible?

Note that 2.21.4 is old, and if you are using SNA contains a known bug with scanline waits which would be triggered in this scenario.
Comment 2 Stefan Dirsch 2013-04-25 10:47:37 UTC
I've forgot to mention. This sample script is running on top of a compiz session (compiz version 0.7.8). "UXA" is in use here. It takes longer to catch the lockup when disabling compiz. "SNA" apparently isn't supported for Haswell. Xserver fails to start (more or less silently).
Comment 3 Stefan Dirsch 2013-04-25 10:54:34 UTC
Unfortunately the machine is no longer remotely accessible. Network adapter even breaks network for machines connected to the same switch. LOL. (but blacklisting the ethernet driver doesn't help; tried that already, it's not the culprit). I'm going to attach the requested files.
Comment 4 Chris Wilson 2013-04-25 10:57:19 UTC
SNA is definitely supported on Haswell - the instability there is more than likely the same root cause.

Using UXA on Haswell implies that you only have quite a few missing features, which helps to rule out the ddx as being the source of your troubles.
Comment 5 Stefan Dirsch 2013-04-25 11:05:46 UTC
Created attachment 78460 [details]
Xorg.0.log
Comment 6 Stefan Dirsch 2013-04-25 11:07:32 UTC
Created attachment 78461 [details]
lspci.txt
Comment 7 Stefan Dirsch 2013-04-25 12:54:51 UTC
Created attachment 78469 [details]
dmesg.log

Initial dmesg output. After starting the sample script I see this via dmesg until the machine freezes:

[  114.824506] [drm:i915_driver_open], 
[  114.850319] [drm:i915_gem_context_create_ioctl], HW context 1 created
[  119.747050] [drm:i915_driver_open], 
[  119.820812] [drm:i915_gem_context_create_ioctl], HW context 1 created
[  129.742931] [drm:i915_driver_open], 
[  129.763661] [drm:i915_gem_context_create_ioctl], HW context 1 created
[  149.748183] [drm:i915_driver_open], 
[  149.769426] [drm:i915_gem_context_create_ioctl], HW context 1 created
[  169.766972] [drm:i915_driver_open], 
[  169.783286] [drm:i915_gem_context_create_ioctl], HW context 1 created
Comment 8 Stefan Dirsch 2013-04-25 12:59:49 UTC
This has been with drm.debug=0xe set.
Comment 9 Chris Wilson 2013-04-25 21:36:19 UTC
I've experienced two hard hangs now running that script. Will try to narrow down the cause over the next few days.
Comment 10 Chris Wilson 2013-04-26 10:30:21 UTC
Still hangs with i915.i915_enable_rc6=0.
Comment 11 Chris Wilson 2013-04-26 10:51:39 UTC
Disabling hw contexts is ineffective.
Comment 12 Chris Wilson 2013-04-26 13:14:12 UTC
i915.reset=0 is no defence.
Comment 13 Chris Wilson 2013-04-26 13:36:41 UTC
i915.semaphores=0 still hangs.
Comment 14 Chris Wilson 2013-04-26 14:45:44 UTC
i915.i915_enable_ppgtt=0 still hangs.

I think I am at the end of the kernel tunables. :|
Comment 15 Chris Wilson 2013-04-26 21:07:46 UTC
Disabling acceleration in the ddx (even the generic BLT code) prevents the hang - though I strongly believe that to be a timing artifact.
Comment 16 Chris Wilson 2013-04-27 11:30:11 UTC
My issue appears to stem from the DPMS kicking in - with xset s 0 -dpms it is still running...
Comment 17 Chris Wilson 2013-04-27 15:29:05 UTC
And then to confirm: immediate hang after several hours of runtime by executing xset dpms force off.
Comment 18 Chris Wilson 2013-04-28 08:48:47 UTC
So it appears not to be the dpms off that is the trigger, but the frequency of the GL rendering: vblank_mode=0 ./bug.sh reproduces the bug quickly.
Comment 19 Chris Wilson 2013-04-30 09:20:55 UTC
Also no squawk over netconsole. :|
Comment 20 Daniel Vetter 2013-04-30 09:26:31 UTC
Adding more people to this fun here ...
Comment 21 Chris Wilson 2013-04-30 12:33:13 UTC
Drumroll please...

=0 haswell:/opt/xorg/src/mesa/mesa (master)$ git diff
diff --git a/src/mesa/drivers/dri/intel/intel_context.c b/src/mesa/drivers/dri/i
index 0a1dd75..65f8738 100644
--- a/src/mesa/drivers/dri/intel/intel_context.c
+++ b/src/mesa/drivers/dri/intel/intel_context.c
@@ -704,7 +704,7 @@ intelInitContext(struct intel_context *intel,
 
    intel->has_separate_stencil = intel->intelScreen->hw_has_separate_stencil;
    intel->must_use_separate_stencil = intel->intelScreen->hw_must_use_separate_
-   intel->has_hiz = intel->gen >= 6;
+   intel->has_hiz = intel->gen >= 6 && 0;
    intel->has_llc = intel->intelScreen->hw_has_llc;
    intel->has_swizzling = intel->intelScreen->hw_has_swizzling;
Comment 22 Stefan Dirsch 2013-04-30 12:52:10 UTC
Wow! Thanks a lot, Chris! Good that you tested also with "outdated" Mesa sources (and thus could reproduce), since apparently this issue has already been addressed in current git. ;-)

commit 1ba8c6ad03a3f03ecc6b66e1c0e10a4d6010122f
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Wed Mar 7 10:16:00 2012 -0800

    i965: Disable HiZ on Haswell for now.
    
    Getting HiZ working means updating all the state packets for resolves
    and clears.  It's not worth doing until we get the basics working.
    
    Signed-off-by: Kenneth Graunke <kenneth@whitecape.org>
    Reviewed-by: Eric Anholt <eric@anholt.net>

diff --git a/src/mesa/drivers/dri/intel/intel_context.c b/src/mesa/drivers/dri/intel/intel_context.c
index fd5f0b6..1aa2e9a 100644
--- a/src/mesa/drivers/dri/intel/intel_context.c
+++ b/src/mesa/drivers/dri/intel/intel_context.c
@@ -629,7 +629,7 @@ intelInitContext(struct intel_context *intel,
 
    intel->has_separate_stencil = intel->intelScreen->hw_has_separate_stencil;
    intel->must_use_separate_stencil = intel->intelScreen->hw_must_use_separate_stencil;
-   intel->has_hiz = intel->gen >= 6;
+   intel->has_hiz = intel->gen >= 6 && !intel->is_haswell;
    intel->has_llc = intel->intelScreen->hw_has_llc;
    intel->has_swizzling = intel->intelScreen->hw_has_swizzling;
Comment 23 Chris Wilson 2013-04-30 13:32:02 UTC
However it has been enabled again:

commit e4484a0309ab44a790df29a599fb2b01eb885d5a
Author: Chad Versace <chad.versace@linux.intel.com>
Date:   Fri Apr 5 16:35:47 2013 -0700

    intel/hsw: Enable hiz (v2)
    
    Enable hiz by setting intel_context::has_hiz.  However, to work around
    a hardware bug, we selectively enable hiz for only nicely aligned miptree
    slices.
Comment 24 Stefan Dirsch 2013-04-30 13:44:27 UTC
Indeed I was proven wrong. The issue is, we're still testing with

   intel->has_hiz = intel->gen >= 6 && !intel->is_haswell;

(Mesa 9.0.3), so HiZ was already disabled for us. :-(
Comment 25 Stefan Dirsch 2013-05-03 13:20:28 UTC
Chris, are you testing with "UXA" or "SNA"? We're still using "UXA" as default.
Comment 26 John Sundragon Waitz 2013-05-21 22:49:21 UTC
Hi, I submitted the initial issue to SUSE.  I would like to ask whether this problem has been observed on hardware implementations other than the HP prototype that was initially mentioned.  From the bugzilla it looked like maybe remote access to the HP proto was used, at least initially.

I have observed that some time after the hang, on our prototype system, the LED indicating Catastrophic Error (CATERR#) comes on, and stays on until the reboot.  I'm wondering if that has also been observed on other implementations (if there is a way to tell).

This CATERR may be a normal consequence of getting stuck, just wanted to add the note.
Comment 27 Stefan Dirsch 2013-05-22 06:11:20 UTC
John, I could reproduce that issue also on a Haswell ULT laptop.
Comment 28 Tolga Dalman 2013-06-15 13:50:47 UTC
I think I'm seeing this bug as well (Gentoo, Linux 3.9.5, Mesa 9.1.3, intel driver 2.21.9 with SNA, xorg-desktop Haswell i7-4770k).
Comment 29 Stefan Dirsch 2013-06-19 14:54:15 UTC
(In reply to comment #25)
> Chris, are you testing with "UXA" or "SNA"? We're still using "UXA" as
> default.

Chris, unfortunately I didn't hear back from you. :-(
Comment 30 Paul Berry 2013-06-26 00:04:45 UTC
Ok, I've managed to reproduce this using a Haswell ULT system, with a stock install of Arch, using UXA under a gnome desktop.  The components are:

Mesa: 9.1.3-2
Kernel: 3.9.7-1
xf86-video-intel: 2.21.10-2
xserver: 1.14.2-1

I ran the test with vblank_mode=0, and it failed after running for 16 min.

Mesa 9.1.3 does not use HiZ on Haswell, so this seems to rule out HiZ as the culprit.

Also, my use of gnome seems to rule out compiz as the source of the problem.

I can confirm Stefan's observation that this is a hard lockup--there is no response from the mouse and no ability to switch to a virtual terminal.

I'll continue investigating and post updates as I discover more.
Comment 31 Paul Berry 2013-06-27 21:32:48 UTC
Ok, here's what I've found:

- I've reproduced the bug 7 times, with the amount of time to failure varying wildly (I've seen 3m*, 8m*, 11m*, 16m, 2h48m, and 4h03m, and one failure where I failed to record the amount of time).

- Note that the times shown above with "*" occurred today, after I upgraded the BIOS on my HSW ULT from version 113 to 126 (and upgraded the KSC EC version from 1.20 to 1.24).  The others occurred earlier in the week.  This makes me suspicious that the problem may be BIOS-related, since the failures seem to be more frequent since the BIOS upgrade.

- Each time I've reproduced the bug I was running with vblank_mode=0.

- I've reproduced the bug both with the stock Arch kernel (3.9.7-1) and with drm-intel-nightly (00b224eee).

- I've reproduced the bug with "iommu=off" on the kernel command line.

- Each time the bug occurs, the computer locks up completely: you can't switch VT's, pressing NumLock fails to toggle the NumLock light, and the machine is unresponsive via ssh.  This is not a simple GPU hang.

- Contrary to comment 16, DPMS does not seem to be involved: I tried running "while true; do sleep 10; xset dpms force off; sleep 10; xset dpms force on; done" in parallel with the script, and it did not noticeably increase the rate of failure.

- I also investigated whether this might be a thermal issue.  One of my failures occurred with a very hot CPU (due to my not setting up fan settings correctly after a BIOS upgrade), but another occurred at a temperature of 61C, which is well within normal operating range.  So I believe it is not a thermal issue.


At this point I believe this is most likely a kernel, BIOS, or hardware bug, and it needs investigation by someone with kernel expertise.  I haven't seen any evidence that it's related to Mesa (which is where my expertise lies).  So I'm reassigning to intel-gfx-bugs@lists.freedesktop.org.
Comment 32 Chris Wilson 2013-06-29 23:08:58 UTC
For me, disabling hiz still makes the difference between the script dying after 1 cycle (after the first kill i.e. less than a minute) and running for an indetermined period.
Comment 33 Stefan Dirsch 2013-07-01 08:13:47 UTC
Chris, I'm afraid I need to ask again. :-( Your test results are with 'sna' or 'UXA'?
Comment 34 Chris Wilson 2013-07-01 08:19:09 UTC
As before, unsurprisingly, the failures I see are with both UXA and SNA.
Comment 35 Stefan Dirsch 2013-07-01 09:15:50 UTC
It wasn't obvious to me, since you didn't mention it before and I'm aware that SNA development is one of your main tasks at Intel. I'm sorry. But this means that  Paul Berry/me and you are either using a different software stack or different hardware. Oh well ...
Comment 36 Ben Widawsky 2013-07-01 17:41:06 UTC
I hit something running this over the weekend. Somehow I ended up with the giant octopus screensaver running when it hung. Noting in netconsole, machine seems totally dead.

The last log message I had was at 50m in, about mce events (I get a bunch of thermal MCE events always). However, I know the test was running for longer than that before I left.

If someone out there can reproduce this fairly quickly, on a hunch, can you try lowering the max GPU freq to the rp1 value? Stefan?
Comment 37 Ben Widawsky 2013-07-01 19:23:46 UTC
On my board, this appears to be an unhandled MCE exception which I've determined via the CATERR signal on the motherboard. Chris can you confirm if this is what you see too?
Comment 38 Ben Widawsky 2013-07-01 19:46:57 UTC
Hmm. I just hit the hang, much quicker, and without CATERR.
Comment 39 Ben Widawsky 2013-07-01 19:50:24 UTC
(In reply to comment #38)
> Hmm. I just hit the hang, much quicker, and without CATERR.

I guess I hit send too soon. What actually happens is CATERR lights up about a minute after the hang. I'm thinking this is because the MCE handler cannot run after the system is hard hung.
Comment 40 Ben Widawsky 2013-07-01 22:51:54 UTC
So I've hooked up an ITP now, and not sure if it's related, but 3 times in a row my machine has powered down instead of hanging.

To me this really smells like a thermal issue.
Comment 41 Chris Wilson 2013-07-02 08:09:32 UTC
It does also fail eventually with hiz disabled. Even with max_freq=RP1, it fails very very quickly with hiz enabled, and eventually without.
Comment 42 Ben Widawsky 2013-07-03 02:56:47 UTC
I too was able to hit it at min frequency. I have punted this one to internal HW/validation teams.

If anyone has any surefire ways to either reduce the complexity of the test, or reduce the time to failure, please let me know. Chris' HiZ trick doesn't work on my machine.
Comment 43 Ben Widawsky 2013-07-09 18:24:57 UTC
Running intel-gpu-top + kde, I can now hit it quite fast:

#!/bin/bash

# demonstration of X server / OpenGL lockup on SLED SP3 Beta 2
# with Haswell graphics.  X server needs to be using "intel" driver.

export LIBGL_DRIVERS_PATH=/home/test/mesa/lib/
prog1=hypertorus
prog2=antmaze
prog3=phosphor

konsole --noclose -e sudo ./intel-gpu-tools/tools/intel_gpu_top

cd /usr/lib64/xscreensaver

vblank_mode=0 ./$prog1 -geometry 1280x1024+10+10 -delay 0 &

sleep 5

vblank_mode=0 ./$prog2 -geometry 1024x768+25+25 -delay 0 &
pid1=$!

mode=0
while true
do
	if [ $mode -eq 0 ]
	then
		vblank_mode=0 ./$prog3 -geometry 1024x768+425+25 -delay 0 &
		pid2=$!
	else
		vblank_mode=0 ./$prog2 -geometry 1024x768+25+25 -delay 0 &
		pid2=$!
	fi
	mode=$(( ( $mode + 1 ) % 2  ))
	# echo $mode

	sleep 10
	kill $pid1
	pid1=$pid2
done
Comment 44 Chris Wilson 2013-07-12 00:08:57 UTC
Created attachment 82350 [details] [review]
serialise most register access

A hack for testing
Comment 46 Takashi Iwai 2013-07-17 13:36:57 UTC
The serialization patch in comment 44 seems helping.  The tests went well, so far.

Though, thinking of backporting to stable kernels (yes, we need it!), the patch series as in comment 45 isn't appropriate.  It makes the patch backporting quite hard due to code juggling.

Can it be at first a simple fix that is easily applicable to older kernels, then cleanup / code shuffle patch series, instead, please?
Comment 47 Chris Wilson 2013-07-17 13:49:35 UTC
The patch https://bugs.freedesktop.org/attachment.cgi?id=82350 is what I would suggest a backport applies, it will fix almost all likely issues. But you really do need to juggle the code around in order to serialise all register access (even if you do not move it into a separate file).
Comment 48 fangxun 2013-07-18 07:30:07 UTC
I try the patch series in comment 45 with latest drm-intel-fixes(7dcd2677e) on our haswell laptop, no hang happened after running about 1 hours.
I also test the same kernel without these patches, it hangs in a minute.
Comment 49 Chris Wilson 2013-07-20 09:07:19 UTC
Step 1:

commit a7cd1b8fea2f341b626b255d9898a5ca5fabbf0a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 19 20:36:51 2013 +0100

    drm/i915: Serialize almost all register access
Comment 50 Ben Widawsky 2013-07-26 21:39:01 UTC
Stefan, Fangxun, how is it going?
Comment 51 Daniel Vetter 2013-07-26 22:14:04 UTC
The duct-tape is now in -fixes and just recently. I've merged the full solution from Chris to dinq. So I think we can close this.

Thanks for reporting this issue and please reopen if it blows up again.
Comment 52 fangxun 2013-07-29 07:37:02 UTC
It works fine with latest drm-next-fixes kernel(61c254) on haswell laptop.
Comment 53 Paul Neumann 2013-07-29 19:17:17 UTC
I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my machine very reliably running the Ben's script in Comment 43. Am I just on the wrong branch or is this not entirely fixed?

For the record, I have a desktop GT2 (on the i7-4770):
$ lspci -nn | grep VGA
00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200 v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06)
Comment 54 Daniel Vetter 2013-08-04 21:41:41 UTC
(In reply to comment #53)
> I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my
> machine very reliably running the Ben's script in Comment 43. Am I just on
> the wrong branch or is this not entirely fixed?
> 
> For the record, I have a desktop GT2 (on the i7-4770):
> $ lspci -nn | grep VGA
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200
> v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06)

Running intel-gpu-top concurrently with the i915.ko kernel driver is known to kill the system. If you can still crash your system without that, then that's something entirely different.
Comment 55 Ben Widawsky 2013-08-05 07:18:34 UTC
(In reply to comment #53)
> I am on drm-intel-next-queued (fae5cbf to be precise) and can still hang my
> machine very reliably running the Ben's script in Comment 43. Am I just on
> the wrong branch or is this not entirely fixed?
> 
> For the record, I have a desktop GT2 (on the i7-4770):
> $ lspci -nn | grep VGA
> 00:02.0 VGA compatible controller [0300]: Intel Corporation Xeon E3-1200
> v3/4th Gen Core Processor Integrated Graphics Controller [8086:0412] (rev 06)

The script I posted is now known to be exercise a HW bug. 

http://lists.freedesktop.org/archives/intel-gfx/2013-July/030188.html

I am in favor of disabling GPU top for HSW without a special parameter FWIW
Comment 56 Kenneth Graunke 2013-08-06 02:52:12 UTC
I'm not a fan of disabling it.  intel_gpu_top has been rock solid on my Haswell, at least...much better than on Ivybridge...
Comment 57 Paul Neumann 2013-08-06 15:02:27 UTC
I see. Without intel_gpu_top, I have not seen any hangs so far, neither with the test case nor with just using the machine (whereas it would not survive two days of uptime without the fixes).

Thanks for the clarification.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.