30380 – [drm-intel-next bisected] glean test case occluQry stuck at ioctl(GEM_BUSY)

Bug 30380 - [drm-intel-next bisected] glean test case occluQry stuck at ioctl(GEM_BUSY)

Summary: [drm-intel-next bisected] glean test case occluQry stuck at ioctl(GEM_BUSY)

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Chris Wilson
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-09-25 20:39 UTC by wang,jinjin
Modified:	2017-09-04 10:02 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
dmesg when stuck at ioctl(GEM_BUSY) (119.80 KB, text/plain) 2010-09-26 01:23 UTC, Shuang He	no flags	Details
View All

Description wang,jinjin 2010-09-25 20:39:06 UTC

System Environment:
--------------------------
Arch:           i386
Platform:       Capella
OSD:            Fedora release 13 (Goddard)
Cairo:          (master)cb0bc64c16b3a38cbf0c622830c18ac9ea6e2ffe
Libdrm:         (master)2.4.21-23-g81fa7a9f56b1efb04658db921e5228c102548921
Mesa:           (7.9)361084ac4b16c6af59671b776b832034990766f0
Xserver:                (master)xorg-server-1.9.0
Xf86_video_intel:               (master)2.12.901-2-gb84925b9c0842ba4dfa3481c09d3a80f84db4838
Libva:          (master)e68bb8bc8ba844f0a5c840fa47467d7056dcd85d
Kernel: (drm-intel-next)5c12a07e8073295ce8b57a822f811ac34e4f8420

Bug detailed description:
-----------------------------------------------
I nailed it down to 5c12a07e8073295ce8b57a822f811ac34e4f8420。With this commit, glean test case occluQry's running have GPU_hang.

The first bad commit info:
-----------------------------------------------
commit 5c12a07e8073295ce8b57a822f811ac34e4f8420
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 22 11:22:30 2010 +0100

    drm/i915: Drop ring->lazy_request

    We are not currently using it as intended, so remove the complication.


Reproduce steps:
----------------
1.xinit& 
2. /GFX/Test/Glean/bin/glean -o -r test -t occluQry

Comment 1 Shuang He 2010-09-25 22:10:09 UTC

oglc/occlusion_query.c
piglit/general_occlusion_query, piglit/general_occlusion-query-discard
piglit/general_timer_query

also have same issue

Comment 2 Chris Wilson 2010-09-26 00:30:51 UTC

Odd, can I have a look at the dmesg and /sys/kernel/debug/dri/0/i915_error_state?

A subtle cause of GPU hangs recently is:


commit 76c1dec1979d9b552aab9600eb898ccec394fbbc
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Sep 25 11:22:51 2010 +0100

    drm/i915: Make the mutex_lock interruptible on ioctl paths
    
    ... and combine it with the wedged completion handler.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

which requires libdrm.git until the debate settles as whether introducing more potential EINTR is an actual abi break.

Comment 3 Shuang He 2010-09-26 00:36:20 UTC

(In reply to comment #2)
> Odd, can I have a look at the dmesg and
> /sys/kernel/debug/dri/0/i915_error_state?
> 
> A subtle cause of GPU hangs recently is:
> 
> 
> commit 76c1dec1979d9b552aab9600eb898ccec394fbbc
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Sat Sep 25 11:22:51 2010 +0100
> 
>     drm/i915: Make the mutex_lock interruptible on ioctl paths
> 
>     ... and combine it with the wedged completion handler.
> 
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> 
> which requires libdrm.git until the debate settles as whether introducing more
> potential EINTR is an actual abi break.

Oh, our description is not correct, it's not GPU hang, it's just those tests will stop waiting return from ioctl(GEM_BUSY). In the normal case, these tests will return immediate.

Comment 4 Chris Wilson 2010-09-26 00:54:43 UTC

Weirder still, other than the acquisition of the lock, i915_gem_busy_ioctl() is meant to be a non-blocking operation. Can you grab a dmesg with 'echo t > /proc/sysctl-trigger' so I can see the contention? And if you have the opportunity run with mutex debugging enabled in your kernel (just in case).

Comment 5 Shuang He 2010-09-26 01:23:15 UTC

Created attachment 38959 [details]
dmesg when stuck at ioctl(GEM_BUSY)

This is dmesg when case stuck at ioctl(GEM_BUSY), with "echo t > /proc/sysrq-trigger"

Comment 6 Shuang He 2010-09-26 01:26:46 UTC

And one more finding, when the case stuck, then we move the mouse pointer in  that window, it will continue running as we moving the mouse pointer

Comment 7 Chris Wilson 2010-09-26 01:47:39 UTC

So we have X throttling, glean waiting in busy, and missing interrupts. Hmm.

Not a fix, but does the problem disappear with tip of drm-intel-next? I'm thinking that:

commit f787a5f59e1b0e320a6b0a37e9a2e306551d1e40
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 24 16:02:42 2010 +0100

    drm/i915: Only hold a process-local lock whilst throttling.
    
    Avoid cause latencies in other clients by not taking the global struct
    mutex and moving the per-client request manipulation a local per-client
    mutex. For example, this allows a compositor to schedule a page-flip
    (through X) whilst an OpenGL application is monopolising the GPU.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

will hide the contention. And leave glean stuck all by itself...

Comment 8 wang,jinjin 2010-09-26 22:14:20 UTC

The problem still appears with commit f787a5f59e1b0e320a6b0a37e9a2e306551d1e40 and the newest commit 1c25595f8d31392b8c36b54c624d01591dbfb87b on drm-intel-next.I just compiled drm modules with above commit and insmod it.

Comment 9 Chris Wilson 2010-09-28 05:43:30 UTC

Still not sure what the cause of the busy ioctl hanging, but dropping the lazy ring request was buggy in its own right:

commit a56ba56c275b1c2b982c8901ab92bf5a0fd0b757
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Sep 28 10:07:56 2010 +0100

    Revert "drm/i915: Drop ring->lazy_request"
    
    With multiple rings generating requests independently, the outstanding
    requests must also be track independently.
    
    Reported-by: Wang Jinjin <jinjin.wang@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

But I haven't worked out the causal link between that and the hang, so please check.

Comment 10 Gordon Jin 2010-10-13 01:23:22 UTC

Jinjin, please retest, by either reverting by yourself, or testing the revert patch on drm-intel-staging.

Comment 11 wang,jinjin 2010-10-13 02:58:42 UTC

I found the commit a56ba56c275b1c2b982c8901ab92bf5a0fd0b757 both on drm-intel-next and drm-intel-staging.So, I tried to test with the 
Kernel: (drm-intel-next)2d7b8366ae4a9ec2183c30e432a4a9a495c82bcd. The problem still had as before.

Comment 12 Chris Wilson 2010-10-13 03:09:42 UTC

Jinjin, thanks for the confirmation.

Comment 13 Chris Wilson 2010-12-01 04:03:35 UTC

I believe this

commit de18a29e0fa3904894b4e02fae0e712cd43f740c
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Sat Nov 27 22:30:41 2010 +0100

    drm/i915: fix regression due to ba3d8d749b01548b9
    
    We don't track gpu flush request in any special way. So even with
    obj->write_domain == 0, a gpu flush might be outstanding but no
    yet executed. Even worse, the latest request might use the object
    only for reading. So and unconditional call to object_wait_rendering
    is needed for !pipelined.
    
    Hence revert that patch fully and untangle the flushing from the
    synchronization again.
    
    Reported-by: Keith Packard <keithp@keithp.com>
    Tested-by: Keith Packard <keithp@keithp.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

should have cleared up this bug.

Comment 14 zhao jian 2010-12-06 22:20:27 UTC

(In reply to comment #13)
> I believe this
> commit de18a29e0fa3904894b4e02fae0e712cd43f740c
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Sat Nov 27 22:30:41 2010 +0100
>     drm/i915: fix regression due to ba3d8d749b01548b9
>     We don't track gpu flush request in any special way. So even with
>     obj->write_domain == 0, a gpu flush might be outstanding but no
>     yet executed. Even worse, the latest request might use the object
>     only for reading. So and unconditional call to object_wait_rendering
>     is needed for !pipelined.
>     Hence revert that patch fully and untangle the flushing from the
>     synchronization again.
>     Reported-by: Keith Packard <keithp@keithp.com>
>     Tested-by: Keith Packard <keithp@keithp.com>
>     Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> should have cleared up this bug.

I find it still exist with the newest kernel on drm-intel-next branch. 
I tested with Kernel: (drm-intel-next)5aa7d52aebfc11760bbc5b081ed621227bb77981

Comment 15 Chris Wilson 2010-12-07 02:58:30 UTC

D'oh. Of course.

Fixed on drm-intel-staging, just waiting on an ack for another patch before committing to -fixes and merging into -next (since it conflicts badly).

commit c2edf2748b45d6a40d30b962fac8721f24b9af70
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 7 10:38:40 2010 +0000

    drm/i915: Emit a request to clear an flushed and idle ring for busy bo
    
    In order for bos to retire eventually, a request must be sent down the
    ring. This is expected, for example, by occlusion queries for which mesa
    will wait upon (whilst running glean) before issuing more batches and so
    the normal activity upon the ring is suspended and we need to emit a
    request to clear the idle ring.
    
    Reported-by: Jinjin, Wang <jinjin.wang@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Comment 16 zhao jian 2010-12-07 18:49:51 UTC

(In reply to comment #15)
> D'oh. Of course.
> Fixed on drm-intel-staging, just waiting on an ack for another patch before
> committing to -fixes and merging into -next (since it conflicts badly).
> commit c2edf2748b45d6a40d30b962fac8721f24b9af70
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Dec 7 10:38:40 2010 +0000
>     drm/i915: Emit a request to clear an flushed and idle ring for busy bo
>     In order for bos to retire eventually, a request must be sent down the
>     ring. This is expected, for example, by occlusion queries for which mesa
>     will wait upon (whilst running glean) before issuing more batches and so
>     the normal activity upon the ring is suspended and we need to emit a
>     request to clear the idle ring.
>     Reported-by: Jinjin, Wang <jinjin.wang@intel.com>
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=30380
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

It now works well with the newest code on drm-intel-staging branch, which include commit c2edf2748b45d6a40d30b962fac8721f24b9af70 and with one commit ahead of it. Now waiting it applies into -fixes and -next branch. 
Kernel: (drm-intel-staging)e1c7e8c08a30f39ccb5e473e58edf94adb07a853

Comment 17 zhao jian 2010-12-13 23:45:14 UTC

Now it works well on both drm-intel-next and drm-intel-fixes branch. 

Kernel:	(drm-intel-fixes) 63abf3edaf42d0b9f278df90fe41c7ed4796b6b1
Kernel:	(drm-intel-next) 8d5203ca62539c6ab36a5bc2402c2de1de460e30

Comment 18 Jari Tahvanainen 2017-09-04 10:02:20 UTC

Closing old verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.