System Environment: -------------------------- Arch: i386 Platform: Capella OSD: Fedora release 13 (Goddard) Cairo: (master)cb0bc64c16b3a38cbf0c622830c18ac9ea6e2ffe Libdrm: (master)2.4.21-23-g81fa7a9f56b1efb04658db921e5228c102548921 Mesa: (7.9)361084ac4b16c6af59671b776b832034990766f0 Xserver: (master)xorg-server-1.9.0 Xf86_video_intel: (master)2.12.901-2-gb84925b9c0842ba4dfa3481c09d3a80f84db4838 Libva: (master)e68bb8bc8ba844f0a5c840fa47467d7056dcd85d Kernel: (drm-intel-next)5c12a07e8073295ce8b57a822f811ac34e4f8420 Bug detailed description: ----------------------------------------------- I nailed it down to 5c12a07e8073295ce8b57a822f811ac34e4f8420。With this commit, glean test case occluQry's running have GPU_hang. The first bad commit info: ----------------------------------------------- commit 5c12a07e8073295ce8b57a822f811ac34e4f8420 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 22 11:22:30 2010 +0100 drm/i915: Drop ring->lazy_request We are not currently using it as intended, so remove the complication. Reproduce steps: ---------------- 1.xinit& 2. /GFX/Test/Glean/bin/glean -o -r test -t occluQry
oglc/occlusion_query.c piglit/general_occlusion_query, piglit/general_occlusion-query-discard piglit/general_timer_query also have same issue
Odd, can I have a look at the dmesg and /sys/kernel/debug/dri/0/i915_error_state? A subtle cause of GPU hangs recently is: commit 76c1dec1979d9b552aab9600eb898ccec394fbbc Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Sep 25 11:22:51 2010 +0100 drm/i915: Make the mutex_lock interruptible on ioctl paths ... and combine it with the wedged completion handler. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> which requires libdrm.git until the debate settles as whether introducing more potential EINTR is an actual abi break.
(In reply to comment #2) > Odd, can I have a look at the dmesg and > /sys/kernel/debug/dri/0/i915_error_state? > > A subtle cause of GPU hangs recently is: > > > commit 76c1dec1979d9b552aab9600eb898ccec394fbbc > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Sat Sep 25 11:22:51 2010 +0100 > > drm/i915: Make the mutex_lock interruptible on ioctl paths > > ... and combine it with the wedged completion handler. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > which requires libdrm.git until the debate settles as whether introducing more > potential EINTR is an actual abi break. Oh, our description is not correct, it's not GPU hang, it's just those tests will stop waiting return from ioctl(GEM_BUSY). In the normal case, these tests will return immediate.
Weirder still, other than the acquisition of the lock, i915_gem_busy_ioctl() is meant to be a non-blocking operation. Can you grab a dmesg with 'echo t > /proc/sysctl-trigger' so I can see the contention? And if you have the opportunity run with mutex debugging enabled in your kernel (just in case).
Created attachment 38959 [details] dmesg when stuck at ioctl(GEM_BUSY) This is dmesg when case stuck at ioctl(GEM_BUSY), with "echo t > /proc/sysrq-trigger"
And one more finding, when the case stuck, then we move the mouse pointer in that window, it will continue running as we moving the mouse pointer
So we have X throttling, glean waiting in busy, and missing interrupts. Hmm. Not a fix, but does the problem disappear with tip of drm-intel-next? I'm thinking that: commit f787a5f59e1b0e320a6b0a37e9a2e306551d1e40 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 24 16:02:42 2010 +0100 drm/i915: Only hold a process-local lock whilst throttling. Avoid cause latencies in other clients by not taking the global struct mutex and moving the per-client request manipulation a local per-client mutex. For example, this allows a compositor to schedule a page-flip (through X) whilst an OpenGL application is monopolising the GPU. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> will hide the contention. And leave glean stuck all by itself...
The problem still appears with commit f787a5f59e1b0e320a6b0a37e9a2e306551d1e40 and the newest commit 1c25595f8d31392b8c36b54c624d01591dbfb87b on drm-intel-next.I just compiled drm modules with above commit and insmod it.
Still not sure what the cause of the busy ioctl hanging, but dropping the lazy ring request was buggy in its own right: commit a56ba56c275b1c2b982c8901ab92bf5a0fd0b757 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Sep 28 10:07:56 2010 +0100 Revert "drm/i915: Drop ring->lazy_request" With multiple rings generating requests independently, the outstanding requests must also be track independently. Reported-by: Wang Jinjin <jinjin.wang@intel.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> But I haven't worked out the causal link between that and the hang, so please check.
Jinjin, please retest, by either reverting by yourself, or testing the revert patch on drm-intel-staging.
I found the commit a56ba56c275b1c2b982c8901ab92bf5a0fd0b757 both on drm-intel-next and drm-intel-staging.So, I tried to test with the Kernel: (drm-intel-next)2d7b8366ae4a9ec2183c30e432a4a9a495c82bcd. The problem still had as before.
Jinjin, thanks for the confirmation.
I believe this commit de18a29e0fa3904894b4e02fae0e712cd43f740c Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sat Nov 27 22:30:41 2010 +0100 drm/i915: fix regression due to ba3d8d749b01548b9 We don't track gpu flush request in any special way. So even with obj->write_domain == 0, a gpu flush might be outstanding but no yet executed. Even worse, the latest request might use the object only for reading. So and unconditional call to object_wait_rendering is needed for !pipelined. Hence revert that patch fully and untangle the flushing from the synchronization again. Reported-by: Keith Packard <keithp@keithp.com> Tested-by: Keith Packard <keithp@keithp.com> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> should have cleared up this bug.
(In reply to comment #13) > I believe this > commit de18a29e0fa3904894b4e02fae0e712cd43f740c > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Sat Nov 27 22:30:41 2010 +0100 > drm/i915: fix regression due to ba3d8d749b01548b9 > We don't track gpu flush request in any special way. So even with > obj->write_domain == 0, a gpu flush might be outstanding but no > yet executed. Even worse, the latest request might use the object > only for reading. So and unconditional call to object_wait_rendering > is needed for !pipelined. > Hence revert that patch fully and untangle the flushing from the > synchronization again. > Reported-by: Keith Packard <keithp@keithp.com> > Tested-by: Keith Packard <keithp@keithp.com> > Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > should have cleared up this bug. I find it still exist with the newest kernel on drm-intel-next branch. I tested with Kernel: (drm-intel-next)5aa7d52aebfc11760bbc5b081ed621227bb77981
D'oh. Of course. Fixed on drm-intel-staging, just waiting on an ack for another patch before committing to -fixes and merging into -next (since it conflicts badly). commit c2edf2748b45d6a40d30b962fac8721f24b9af70 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 7 10:38:40 2010 +0000 drm/i915: Emit a request to clear an flushed and idle ring for busy bo In order for bos to retire eventually, a request must be sent down the ring. This is expected, for example, by occlusion queries for which mesa will wait upon (whilst running glean) before issuing more batches and so the normal activity upon the ring is suspended and we need to emit a request to clear the idle ring. Reported-by: Jinjin, Wang <jinjin.wang@intel.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
(In reply to comment #15) > D'oh. Of course. > Fixed on drm-intel-staging, just waiting on an ack for another patch before > committing to -fixes and merging into -next (since it conflicts badly). > commit c2edf2748b45d6a40d30b962fac8721f24b9af70 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Dec 7 10:38:40 2010 +0000 > drm/i915: Emit a request to clear an flushed and idle ring for busy bo > In order for bos to retire eventually, a request must be sent down the > ring. This is expected, for example, by occlusion queries for which mesa > will wait upon (whilst running glean) before issuing more batches and so > the normal activity upon the ring is suspended and we need to emit a > request to clear the idle ring. > Reported-by: Jinjin, Wang <jinjin.wang@intel.com> > Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> It now works well with the newest code on drm-intel-staging branch, which include commit c2edf2748b45d6a40d30b962fac8721f24b9af70 and with one commit ahead of it. Now waiting it applies into -fixes and -next branch. Kernel: (drm-intel-staging)e1c7e8c08a30f39ccb5e473e58edf94adb07a853
Now it works well on both drm-intel-next and drm-intel-fixes branch. Kernel: (drm-intel-fixes) 63abf3edaf42d0b9f278df90fe41c7ed4796b6b1 Kernel: (drm-intel-next) 8d5203ca62539c6ab36a5bc2402c2de1de460e30
Closing old verified+fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.