Created attachment 15005 [details] [review] standard signed-off kernel patch to fix intel driver lockup on buffer swap Running an opengl application which attempts to synchronize buffer swapping to the vertical sync event eventually results in a hard lockup of the Linux kernel. When this happens even trying the magic sysrq key won't show any signs of life. This problem takes from 5 to 90 minutes of running before the lockup happens, and you need to be running dual head cloned mode for it to happen at all. The root cause is not related to dual head, but the less predictable interrupt rate when running this way I am guessing really exacerbates the underlying race condition, raising the probability of failure. A standard signed-off kernel patch which fixes the problem is attached to this bug. I am not a git expert so I apologize that I don't have a git changeset built against the drm git repository. (I did however check that this problem also still appears to be present in the git repository.) It should be trivial to apply this patch in any case. This patch was built against the 2.6.24.3 vanilla kernel source tree. Following are the comments from the attached patch: <blockquote> The i915_vblank_swap() function schedules an automatic buffer swap upon receipt of the vertical sync interrupt. Such an operation is lengthy so it can't happen normal interrupt context, so the DRM implements this by scheduling the work in a kernel softirq-scheduled tasklet. In order for the buffer swap to work safely, the DRM's central lock must be taken, via a call to drm_lock_take() in drm_irq.c within the function drm_locked_tasklet_func(). The lock-taking logic uses a non-interrupt-blocking spinlock to implement the manipulations needed to take the lock. Note that a non-interrupt-blocking spinlock blocks kernel pre-emption and atomically sets a flag, but interrupts are still enabled. This semantic is safe if ALL attempts to use the spinlock only happen from process context. However this buffer swap happens from softirq context which is really a form of interrupt context that WILL pre-empt execution even when normal thread pre-emption is otherwise disabled. Thus we have an unsafe situation, in that drm_locked_tasklet_func() can block on a spinlock already taken by a thread in process context which will never get scheduled again because of the blocked softirq tasklet. This wedges the kernel hard. It's a very small race condition, but a race nonetheless with a very undesirable potential outcome. To trigger this bug, run a dual-head cloned mode configuration which uses the i915 drm, then execute an opengl application which synchronizes buffer swaps against the vertical sync interrupt. In my testing, a lockup always results after running anywhere from 5 minutes to an hour and a half. I believe dual-head is needed to really trigger the problem because then the vertical sync interrupt handling is no longer predictable (due to being interrupt-sourced from two different heads running at different speeds). This raises the probability of a the tasklet trying to run while the userspace DRI is doing things to the GPU (and manipulating the DRM lock). The fix is to change the relevant spinlock semantics to be the interrupt-blocking form. Thus all calls for this spinlock change from spin_lock() to spin_lock_irqsave() and similarly calls to spin_unlock() change to spin_unlock_irqrestore(). After this change I am no longer able to trigger the lockup; the longest test run so far was 6 hours (test stopped after that point). Signed-off-by: Mike Isely <isely@pobox.com> </blockquote>
Dave pushed the fix to drm Git and submitted it for 2.6.25.
I think this fix is partially incorrect. The idea of tasklets is that you should be able to run a "bottom half" of the IRQ handler with hard IRQs enabled. Data that is shared between a hard interrupt handler and a tasklet or a normal process should use spin_lock_irqsave() whereas data shared between a tasklet and a normal process should use spin_lock_bh() So in this case, I think we should be using spin_lock_bh() to avoid unnecessary disabling of hard IRQs. /Thomas
Interesting. I was unaware of spin_lock_bh(), but having looked at it now I agree. If I had seen it before, I would have used it. But for what it's worth, I didn't use spin_lock_irqsave() without some care first. This spin lock is only ever taken for very short reasonably deterministic, non-blocking intervals in the DRM code (as part of taking a much heavier-weight lock). So I felt that any latency impact from interrupt blockage in these cases should be negligible and thus I didn't really look for a different (potentially more complex solution). -Mike
(In reply to comment #2) > Data that is shared between a hard interrupt handler and a tasklet or a normal > process should use spin_lock_irqsave() whereas data shared between a tasklet > and a normal process should use spin_lock_bh() Note that the tasklet callback function may also be called from normal process context if the tasklet couldn't acquire the DRM lock. Would spin_lock_bh() still work in that case?
(In reply to comment #4) > Note that the tasklet callback function may also be called from normal process > context if the tasklet couldn't acquire the DRM lock. Would spin_lock_bh() > still work in that case? Never mind, of course this spinlock is only used by the actual tasklet...
so did someone want to submit a patch to use _bh?
I've pushed a fix for this now. Let me know if there are any problems with it. /Thomas
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.