Bug 37572 - [sandybridge lockup] hard lockup with semaphores on
[sandybridge lockup] hard lockup with semaphores on
Status: RESOLVED DUPLICATE of bug 36652
Product: DRI
Classification: Unclassified
Component: DRM/Intel
XOrg git
Other All
: high normal
Assigned To: Chris Wilson
Depends on:
  Show dependency treegraph
Reported: 2011-05-25 03:53 UTC by Hans de Goede
Modified: 2011-10-07 23:16 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:

xorg.log (62.72 KB, text/plain)
2011-05-25 03:54 UTC, Hans de Goede
no flags Details
dmesg (123.20 KB, text/plain)
2011-05-25 03:54 UTC, Hans de Goede
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Hans de Goede 2011-05-25 03:53:40 UTC

While trying various things to get rid of bug 37568, I also tried intel-drm-next on my sandybridge machine. And it locks the machine hard in approx 10 minutes of usage, the system completely freezes, network dead, even pressing the
powerbutton for> 4 secs does not work. This is not instantly reproducible but
happens after 1 - 10 minutes working under gnome3.

System environment:
-- chipset: i5-2400 (HD 2000 gfx), Intel Corporation Cougar Point mobo
-- system architecture: 64-bit
-- xf86-video-intel: xorg-x11-drv-intel-2.15.0-3.fc15.x86_64
-- xserver: xorg-x11-server-Xorg-1.10.1-14.fc15.x86_64
-- mesa: git 5af46e836073d2112b147b524e441bdb808cc128
-- libdrm: libdrm-2.4.25-1.fc15.x86_64
-- kernel: tried with, 2.6.39-1.fc16.x86_64,
           drm-intel-next 9e3c256d7d56a12a3242222945ce8e6347f93fa0
-- Linux distribution: Fedora 15
-- Machine or mobo model: FUJITSU mobo: D3071-S1
-- Display connector: DVI (1920x1200@60)


Comment 1 Hans de Goede 2011-05-25 03:54:04 UTC
Created attachment 47133 [details]
Comment 2 Hans de Goede 2011-05-25 03:54:31 UTC
Created attachment 47134 [details]
Comment 3 Chris Wilson 2011-05-25 08:15:58 UTC
Does echo 0 > /sys/module/i915/parameters/reset make any difference?
Comment 4 Hans de Goede 2011-05-25 12:42:58 UTC
(In reply to comment #3)
> Does echo 0 > /sys/module/i915/parameters/reset make any difference?

I'm afraid it does not help, I did learn something interesting, this only happens when semaphores are on. If they are off I don't get the hard lockup (I don't get any lockup at all, other then the hickups caused by the missed interrupts discussed in bug 37568).
Comment 5 Hans de Goede 2011-05-26 01:49:55 UTC
I hit a crash which feels similar with 2.6.39-1.fc16.x86_64 today, with semaphores on. The difference between the crash on 2.6.39-1.fc16.x86_64 and intel-drm-next, is that with 2.6.39-1.fc16.x86_64, the system still is somewhat alive. I can ssh in and dmesg shows:

6299] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
6311] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
9952] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
9963] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring
0961] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
0972] [drm:kick_ring] *ERROR* Kicking stuck semaphore on blt ring

Going on for-ever and ever. So I did a:
echo 1 > /sys/class/drm/card0/device/reset

And that broke the loop, dmesg now said:

6393] [drm:i915_wait_request] *ERROR* something (likely vbetool) disabled interrupts, re-enabling
8678] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung 
8689] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
3721] [drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 515195 at 512777, next 515196)

I've put the error state from after these messages here:

Note that after the reset, ssh still worked, but my display remained black, and I had to reset the machine.
Comment 6 Hans de Goede 2011-05-28 01:18:21 UTC
I hit this today again with what will become 2.6.40 / 3.0, using linus' latest tree git 5b2fad064c74265d53750931094212afb791f75e.

I've the feeling that this might be related to bug 37621, except that with 2.6.39 the system successfully recovers where as with intel-drm-next / 2.6.40rc it hangs hard instead.
Comment 7 Eugeni Dodonov 2011-08-22 12:21:42 UTC
Were there any improvements with the latest intel linux graphics stack and 3.0.x kernels?
Comment 8 Hans de Goede 2011-08-26 04:49:03 UTC
(In reply to comment #7)
> Were there any improvements with the latest intel linux graphics stack and
> 3.0.x kernels?

I've not tried running with semaphores on for a large time now, so I don't know.
Comment 9 Eugeni Dodonov 2011-09-29 10:58:27 UTC
Could you retry with latest kernel and mesa and with semaphores enabled, and see if the issue still happens on your system?

If the issue is gone, we'd like to have semaphores enabled by default, as it fixes several other gpu hangs..
Comment 10 Eugeni Dodonov 2011-09-29 14:09:44 UTC
Also, could you please verify if you have vt-d enabled in bios, and retest with it disabled?

We have found out that it influences many of the semaphores-related issues, perhaps it is the case?

Comment 11 Hans de Goede 2011-09-30 07:02:10 UTC
I've run for one hour+ with 3.1.0-rc8 and mesa master (d742a64909b2b414fc94b6f525a13ce09ca7f9f7) both with and without VT-d enabled, and in both cases I've experienced no hang. So it seems that this issue is fixed. 

If I do hit a hang the coming few days I'll update this bug. Note that when 
I disabled VT-d, I also had to disable the x2apic in the BIOS, as with VT-d disabled and the x2apic enabled 3.1.0-rc8 would not boot.

I believe that in the past I've tried with both the x2apic enabled and disabled (and VT-d always enabled) and that did not make a difference, I got the hang with semafores on independent of the x2apic setting.

Feel free to close this (unless you want more info / want to wait a bit).
Comment 12 Chris Wilson 2011-09-30 08:58:14 UTC
Closing until the issue reappears ;-)
Comment 13 Gordon Jin 2011-10-07 23:16:52 UTC
looks similar to bug#36652, both disappear with the new kernel.

*** This bug has been marked as a duplicate of bug 36652 ***