When run glean case pointSprite with i965 driver(I ran on Q965), it will segment fault. The backtrace info is below: #0 0xffffe410 in __kernel_vsyscall () #1 0x4694c159 in raise () from /lib/libc.so.6 #2 0x4694d6e3 in abort () from /lib/libc.so.6 #3 0xb7afd326 in dri_ttm_fence_wait (fence=0x83fab50) at intel_bufmgr_ttm.c:625 #4 0xb7af6546 in dri_fence_wait (fence=0x83fab50) at ../common/dri_bufmgr.c:93 #5 0xb7afe6bf in intelFinish (ctx=0x810c6a0) at intel_context.c:313 #6 0xb7b0979c in intelSpanRenderStart (ctx=0x810c6a0) at intel_span.c:316 #7 0xb7c21fbc in _swrast_ReadPixels (ctx=0x810c6a0, x=0, y=0, width=40, height=40, format=6407, type=5126, packing=0x8118314, pixels=0x83f2af8) at swrast/s_readpix.c:562 #8 0xb7c7a624 in _mesa_ReadPixels (x=0, y=0, width=40, height=40, format=6407, type=5126, pixels=0x83f2af8) at main/drawpix.c:304 #9 0xb7f11eb3 in glReadPixels (x=0, y=0, width=40, height=40, format=6407, type=5126, pixels=0x83f2af8) at ../../../src/mesa/glapi/glapitemp.h:1365 #10 0x080848e8 in GLEAN::PointSpriteTest::runOne (this=0x80ed060, r=@0x8135818, w=@0xbfd35c38) at tpointsprite.cpp:398 #11 0x0805a81f in GLEAN::BaseTest<GLEAN::MultiTestResult>::run ( this=0x80ed060, environment=@0xbfd35cc0) at tbase.h:290 #12 0x0805443b in main (argc=5, argv=0xbfd35e04) at main.cpp:128 if export INTEL_NO_TTM=1, the case also aborted with info: intelWaitIrq: drmI830IrqWait: -16
So it must have failed with an error about the fence wait printed out Did the X server die soon after (other rendering failed due to chip hang?) I can't reproduce this on my GM965 system.
Actually it's entirely possible to fill the ring with rendering commands that will take more than 3 seconds to complete. I see that problem with "gltestperf" and i915 ttm. I'm not saying that's the problem here but it may be a candidate. It's hard to come up with a good way to solve this problem, but ideally we shouldn't rely on fences to detect timeouts. /Thomas
(In reply to comment #2) > It's hard to come up with a good way to solve this problem, but ideally we > shouldn't rely on fences to detect timeouts. > err... detect lockups not timeouts. /Thomas
Yeah, this has always been a problem, but things seem to have become more fragile since TTM. We should fix the fence wait code to do basically what we used to with irq waits from userland -- if the ring's made any progress, don't bail out yet. We should also probably reduce ring size once we've got everything going through batchbuffers, so that you don't end up with such huge queues to chew through (this is a problem for X where x11perf or a similar dump-rendering-without-waiting app destroys interactivity). Doing it today before we've moved 2D to batchbuffers may kill our already lackluster 2D performance.
(In reply to comment #4) > Yeah, this has always been a problem, but things seem to have become more > fragile since TTM. We should fix the fence wait code to do basically what we > used to with irq waits from userland -- if the ring's made any progress, don't > bail out yet. > What we're doing elsewhere is to have a lockup watchdog that checks on rendering progress (once every 1/2 second or so, so it's not resource-consuming). When it detects what it think is a lockup it goes on checking in more detail what's happening and if it's indeed a lockup, it sets the error code on the fence, which will signal it, releasing waiting clients. Clients that checks on errors will se an error. Then the GPU is reset. The net effect is that the 3D client aborts on the fence error, but the X server can continue rendering as if nothing happened (provided the GPU survives the reset). This means that fence timeouts can be (and are) upped to 20 seconds or so. Still, with programmable GPUs it's entirely possible to hit that limit as well, so we should probably, as you say, do something sane about this, and also the userland timeout problem, caused by waits that are interrupted by a signal. Possibly the best solution is to have a timeout delay on each fence instead of on the waiting itself. This avoids the need for a watchdog and a lockup check is instead triggered when a fence has passed it's time-out delay and is waited for. /Thomas
The pointSprite case also passed on my machine now. Close this bug. For the fence probelm, please open a new bug if you meet it.
when run gltestperf, it aborted with error: intel_bufmgr_ttm.c:626: Error -16 waiting for fence fence buffers: Device or resource busy. Is there any bug related to this? otherwise, we can open a new one to track it.
(In reply to comment #7) > when run gltestperf, it aborted with error: > intel_bufmgr_ttm.c:626: Error -16 waiting for fence fence buffers: Device or > resource busy. Is there any bug related to this? otherwise, we can open a new > one to track it. > No bug that I know of. I'll open one.
Mass version move, cvs -> git
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.