13918 – [965 TTM] glean case pointSprite segment fault

Bug 13918 - [965 TTM] glean case pointSprite segment fault

Summary: [965 TTM] glean case pointSprite segment fault

Status:	VERIFIED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Eric Anholt
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2008-01-03 21:17 UTC by WuNian
Modified:	2009-08-24 12:29 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description WuNian 2008-01-03 21:17:15 UTC

When run glean case pointSprite with i965 driver(I ran on Q965), it will segment fault. The backtrace info is below:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x4694c159 in raise () from /lib/libc.so.6
#2  0x4694d6e3 in abort () from /lib/libc.so.6
#3  0xb7afd326 in dri_ttm_fence_wait (fence=0x83fab50)
    at intel_bufmgr_ttm.c:625
#4  0xb7af6546 in dri_fence_wait (fence=0x83fab50) at ../common/dri_bufmgr.c:93
#5  0xb7afe6bf in intelFinish (ctx=0x810c6a0) at intel_context.c:313
#6  0xb7b0979c in intelSpanRenderStart (ctx=0x810c6a0) at intel_span.c:316
#7  0xb7c21fbc in _swrast_ReadPixels (ctx=0x810c6a0, x=0, y=0, width=40,
    height=40, format=6407, type=5126, packing=0x8118314, pixels=0x83f2af8)
    at swrast/s_readpix.c:562
#8  0xb7c7a624 in _mesa_ReadPixels (x=0, y=0, width=40, height=40,
    format=6407, type=5126, pixels=0x83f2af8) at main/drawpix.c:304
#9  0xb7f11eb3 in glReadPixels (x=0, y=0, width=40, height=40, format=6407,
    type=5126, pixels=0x83f2af8) at ../../../src/mesa/glapi/glapitemp.h:1365
#10 0x080848e8 in GLEAN::PointSpriteTest::runOne (this=0x80ed060,
    r=@0x8135818, w=@0xbfd35c38) at tpointsprite.cpp:398
#11 0x0805a81f in GLEAN::BaseTest<GLEAN::MultiTestResult>::run (
    this=0x80ed060, environment=@0xbfd35cc0) at tbase.h:290
#12 0x0805443b in main (argc=5, argv=0xbfd35e04) at main.cpp:128

if export INTEL_NO_TTM=1, the case also aborted with info:
intelWaitIrq: drmI830IrqWait: -16

Comment 1 Eric Anholt 2008-02-15 11:08:31 UTC

So it must have failed with an error about the fence wait printed out  Did the X server die soon after (other rendering failed due to chip hang?)  I can't reproduce this on my GM965 system.

Comment 2 Thomas Hellström 2008-02-15 12:13:02 UTC

Actually it's entirely possible to fill the ring with rendering commands that will take more than 3 seconds to complete. I see that problem with "gltestperf" and i915 ttm. 

I'm not saying that's the problem here but it may be a candidate.

It's hard to come up with a good way to solve this problem, but ideally we shouldn't rely on fences to detect timeouts.

/Thomas

Comment 3 Thomas Hellström 2008-02-15 14:07:02 UTC

(In reply to comment #2)

> It's hard to come up with a good way to solve this problem, but ideally we
> shouldn't rely on fences to detect timeouts.
> 
err... detect lockups not timeouts.

/Thomas

Comment 4 Eric Anholt 2008-02-15 14:21:12 UTC

Yeah, this has always been a problem, but things seem to have become more fragile since TTM.  We should fix the fence wait code to do basically what we used to with irq waits from userland -- if the ring's made any progress, don't bail out yet.

We should also probably reduce ring size once we've got everything going through batchbuffers, so that you don't end up with such huge queues to chew through (this is a problem for X where x11perf or a similar dump-rendering-without-waiting app destroys interactivity).  Doing it today before we've moved 2D to batchbuffers may kill our already lackluster 2D performance.

Comment 5 Thomas Hellström 2008-02-16 01:46:29 UTC

(In reply to comment #4)
> Yeah, this has always been a problem, but things seem to have become more
> fragile since TTM.  We should fix the fence wait code to do basically what we
> used to with irq waits from userland -- if the ring's made any progress, don't
> bail out yet.
> 

What we're doing elsewhere is to have a lockup watchdog that checks on rendering progress (once every 1/2 second or so, so it's not resource-consuming). When it detects what it think is a lockup it goes on checking in more detail what's happening and if it's indeed a lockup, it sets the error code on the fence, which will signal it, releasing waiting clients. Clients that checks on errors will se an error. Then the GPU is reset. The net effect is that the 3D client aborts on the fence error, but the X server can continue rendering as if nothing happened (provided the GPU survives the reset). 

This means that fence timeouts can be (and are) upped to 20 seconds or so.
Still, with programmable GPUs it's entirely possible to hit that limit as well, so we should probably, as you say, do something sane about this, and also the userland timeout problem, caused by waits that are interrupted by a signal.

Possibly the best solution is to have a timeout delay on each fence instead of on the waiting itself. This avoids the need for a watchdog and a lockup check is instead triggered when a fence has passed it's time-out delay and is waited for.  

/Thomas

Comment 6 WuNian 2008-02-17 19:04:02 UTC

The pointSprite case also passed on my machine now. Close this bug.
For the fence probelm, please open a new bug if you meet it.

Comment 7 WuNian 2008-02-17 19:12:09 UTC

when run gltestperf, it aborted with error:
intel_bufmgr_ttm.c:626: Error -16 waiting for fence fence buffers: Device or resource busy. Is there any bug related to this? otherwise, we can open a new one to track it.

Comment 8 Thomas Hellström 2008-02-18 02:22:26 UTC

(In reply to comment #7)
> when run gltestperf, it aborted with error:
> intel_bufmgr_ttm.c:626: Error -16 waiting for fence fence buffers: Device or
> resource busy. Is there any bug related to this? otherwise, we can open a new
> one to track it.
> 

No bug that I know of.
I'll open one.

Comment 9 Adam Jackson 2009-08-24 12:29:05 UTC

Mass version move, cvs -> git

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.