14937 – Intel i915 hard lockup when using vsync-synchronized swapping

Bug 14937 - Intel i915 hard lockup when using vsync-synchronized swapping

Summary: Intel i915 hard lockup when using vsync-synchronized swapping

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/other (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	highest major
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:	patch

Depends on:
Blocks:

Reported:	2008-03-10 08:44 UTC by Mike Isely
Modified:	2008-04-23 08:35 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
standard signed-off kernel patch to fix intel driver lockup on buffer swap (7.34 KB, patch) 2008-03-10 08:44 UTC, Mike Isely	no flags	Details \| Splinter Review
View All

Description Mike Isely 2008-03-10 08:44:02 UTC

Created attachment 15005 [details] [review]
standard signed-off kernel patch to fix intel driver lockup on buffer swap

Running an opengl application which attempts to synchronize buffer swapping to the vertical sync event eventually results in a hard lockup of the Linux kernel.  When this happens even trying the magic sysrq key won't show any signs of life.  This problem takes from 5 to 90 minutes of running before the lockup happens, and you need to be running dual head cloned mode for it to happen at all.  The root cause is not related to dual head, but the less predictable interrupt rate when running this way I am guessing really exacerbates the underlying race condition, raising the probability of failure.  

A standard signed-off kernel patch which fixes the problem is attached to this bug.  I am not a git expert so I apologize that I don't have a git changeset built against the drm git repository.  (I did however check that this problem also still appears to be present in the git repository.)  It should be trivial to apply this patch in any case.  This patch was built against the 2.6.24.3 vanilla kernel source tree.

Following are the comments from the attached patch:

<blockquote>
The i915_vblank_swap() function schedules an automatic buffer swap
upon receipt of the vertical sync interrupt.  Such an operation is
lengthy so it can't happen normal interrupt context, so the DRM
implements this by scheduling the work in a kernel softirq-scheduled
tasklet.  In order for the buffer swap to work safely, the DRM's
central lock must be taken, via a call to drm_lock_take() in drm_irq.c
within the function drm_locked_tasklet_func().  The lock-taking logic
uses a non-interrupt-blocking spinlock to implement the manipulations
needed to take the lock.  Note that a non-interrupt-blocking spinlock
blocks kernel pre-emption and atomically sets a flag, but interrupts
are still enabled.  This semantic is safe if ALL attempts to use the
spinlock only happen from process context.  However this buffer swap
happens from softirq context which is really a form of interrupt
context that WILL pre-empt execution even when normal thread
pre-emption is otherwise disabled.  Thus we have an unsafe situation,
in that drm_locked_tasklet_func() can block on a spinlock already
taken by a thread in process context which will never get scheduled
again because of the blocked softirq tasklet.  This wedges the kernel
hard.

It's a very small race condition, but a race nonetheless with a very
undesirable potential outcome.  To trigger this bug, run a dual-head
cloned mode configuration which uses the i915 drm, then execute an
opengl application which synchronizes buffer swaps against the
vertical sync interrupt.  In my testing, a lockup always results after
running anywhere from 5 minutes to an hour and a half.  I believe
dual-head is needed to really trigger the problem because then the
vertical sync interrupt handling is no longer predictable (due to
being interrupt-sourced from two different heads running at different
speeds).  This raises the probability of a the tasklet trying to run
while the userspace DRI is doing things to the GPU (and manipulating
the DRM lock).

The fix is to change the relevant spinlock semantics to be the
interrupt-blocking form.  Thus all calls for this spinlock change from
spin_lock() to spin_lock_irqsave() and similarly calls to
spin_unlock() change to spin_unlock_irqrestore().  After this change I
am no longer able to trigger the lockup; the longest test run so far
was 6 hours (test stopped after that point).

Signed-off-by: Mike Isely <isely@pobox.com>
</blockquote>

Comment 1 Michel Dänzer 2008-03-18 02:12:40 UTC

Dave pushed the fix to drm Git and submitted it for 2.6.25.

Comment 2 Thomas Hellström 2008-04-13 01:17:19 UTC

I think this fix is partially incorrect.

The idea of tasklets is that you should be able to run a "bottom half" of the IRQ handler with hard IRQs enabled. 

Data that is shared between a hard interrupt handler and a tasklet or a normal process should use spin_lock_irqsave() whereas data shared between a tasklet and a normal process should use spin_lock_bh()

So in this case, I think we should be using spin_lock_bh() to avoid unnecessary disabling of hard IRQs.

/Thomas

Comment 3 Mike Isely 2008-04-13 01:41:39 UTC

Interesting.  I was unaware of spin_lock_bh(), but having looked at it now I agree.  If I had seen it before, I would have used it.

But for what it's worth, I didn't use spin_lock_irqsave() without some care first.   This spin lock is only ever taken for very short reasonably deterministic, non-blocking intervals in the DRM code (as part of taking a much heavier-weight lock).  So I felt that any latency impact from interrupt blockage in these cases should be negligible and thus I didn't really look for a different (potentially more complex solution).

  -Mike

Comment 4 Michel Dänzer 2008-04-13 02:54:10 UTC

(In reply to comment #2)
> Data that is shared between a hard interrupt handler and a tasklet or a normal
> process should use spin_lock_irqsave() whereas data shared between a tasklet
> and a normal process should use spin_lock_bh()

Note that the tasklet callback function may also be called from normal process context if the tasklet couldn't acquire the DRM lock. Would spin_lock_bh() still work in that case?

Comment 5 Michel Dänzer 2008-04-13 02:56:16 UTC

(In reply to comment #4)
> Note that the tasklet callback function may also be called from normal process
> context if the tasklet couldn't acquire the DRM lock. Would spin_lock_bh()
> still work in that case?

Never mind, of course this spinlock is only used by the actual tasklet...

Comment 6 Dave Airlie 2008-04-21 00:01:40 UTC

so did someone want to submit a patch to use _bh?

Comment 7 Thomas Hellström 2008-04-23 08:35:59 UTC

I've pushed a fix for this now. Let me know if there are any problems with it.

/Thomas

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.