Bugzilla – Bug 37572
[sandybridge lockup] hard lockup with semaphores on
Last modified: 2011-10-07 23:16:52 UTC
While trying various things to get rid of bug 37568, I also tried intel-drm-next on my sandybridge machine. And it locks the machine hard in approx 10 minutes of usage, the system completely freezes, network dead, even pressing the
powerbutton for> 4 secs does not work. This is not instantly reproducible but
happens after 1 - 10 minutes working under gnome3.
-- chipset: i5-2400 (HD 2000 gfx), Intel Corporation Cougar Point mobo
-- system architecture: 64-bit
-- xf86-video-intel: xorg-x11-drv-intel-2.15.0-3.fc15.x86_64
-- xserver: xorg-x11-server-Xorg-1.10.1-14.fc15.x86_64
-- mesa: git 5af46e836073d2112b147b524e441bdb808cc128
-- libdrm: libdrm-2.4.25-1.fc15.x86_64
-- kernel: tried with 220.127.116.11-28.rc1.fc15.x86_64, 2.6.39-1.fc16.x86_64,
-- Linux distribution: Fedora 15
-- Machine or mobo model: FUJITSU mobo: D3071-S1
-- Display connector: DVI (1920x1200@60)
Created attachment 47133 [details]
Created attachment 47134 [details]
Does echo 0 > /sys/module/i915/parameters/reset make any difference?
(In reply to comment #3)
> Does echo 0 > /sys/module/i915/parameters/reset make any difference?
I'm afraid it does not help, I did learn something interesting, this only happens when semaphores are on. If they are off I don't get the hard lockup (I don't get any lockup at all, other then the hickups caused by the missed interrupts discussed in bug 37568).
I hit a crash which feels similar with 2.6.39-1.fc16.x86_64 today, with semaphores on. The difference between the crash on 2.6.39-1.fc16.x86_64 and intel-drm-next, is that with 2.6.39-1.fc16.x86_64, the system still is somewhat alive. I can ssh in and dmesg shows:
6299] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
6311] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
9952] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
9963] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
0961] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
0972] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
Going on for-ever and ever. So I did a:
echo 1 > /sys/class/drm/card0/device/reset
And that broke the loop, dmesg now said:
6393] [drm:i915_wait_request] *ERROR* something (likely vbetool) disabled interrupts, re-enabling
8678] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
8689] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
3721] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 515195 at 512777, next 515196)
I've put the error state from after these messages here:
Note that after the reset, ssh still worked, but my display remained black, and I had to reset the machine.
I hit this today again with what will become 2.6.40 / 3.0, using linus' latest tree git 5b2fad064c74265d53750931094212afb791f75e.
I've the feeling that this might be related to bug 37621, except that with 2.6.39 the system successfully recovers where as with intel-drm-next / 2.6.40rc it hangs hard instead.
Were there any improvements with the latest intel linux graphics stack and 3.0.x kernels?
(In reply to comment #7)
> Were there any improvements with the latest intel linux graphics stack and
> 3.0.x kernels?
I've not tried running with semaphores on for a large time now, so I don't know.
Could you retry with latest kernel and mesa and with semaphores enabled, and see if the issue still happens on your system?
If the issue is gone, we'd like to have semaphores enabled by default, as it fixes several other gpu hangs..
Also, could you please verify if you have vt-d enabled in bios, and retest with it disabled?
We have found out that it influences many of the semaphores-related issues, perhaps it is the case?
I've run for one hour+ with 3.1.0-rc8 and mesa master (d742a64909b2b414fc94b6f525a13ce09ca7f9f7) both with and without VT-d enabled, and in both cases I've experienced no hang. So it seems that this issue is fixed.
If I do hit a hang the coming few days I'll update this bug. Note that when
I disabled VT-d, I also had to disable the x2apic in the BIOS, as with VT-d disabled and the x2apic enabled 3.1.0-rc8 would not boot.
I believe that in the past I've tried with both the x2apic enabled and disabled (and VT-d always enabled) and that did not make a difference, I got the hang with semafores on independent of the x2apic setting.
Feel free to close this (unless you want more info / want to wait a bit).
Closing until the issue reappears ;-)
looks similar to bug#36652, both disappear with the new kernel.
*** This bug has been marked as a duplicate of bug 36652 ***