Bug 23688

Summary:	i965 memory corruption with FBO rendering
Product:	Mesa	Reporter:	Brian Paul <brian.paul>
Component:	Drivers/DRI/i965	Assignee:	Eric Anholt <eric>
Status:	VERIFIED FIXED	QA Contact:
Severity:	critical
Priority:	medium	CC:	shuang.he
Version:	git
Hardware:	x86-64 (AMD64)
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	fbo memory corruption test prog GPU dump when fbo.c hang GPU

Description Brian Paul 2009-09-03 08:08:26 UTC

Created attachment 29177 [details]
fbo memory corruption test prog

I've found some kind of gem memory manager bug that causes rendering to an FBO to write to (random?) regions of memory.  This either causes screen corruption or GPU lock-up.

The attached test program exercises the bug.  The main loop is basically:

while (1) {
   Create a small texture and an FBO that renders into it.
   Clear the FBO to red or green on alternate frames.
   Render a purple cube into the FBO.
   Blit the FBO image to the window.
   Destroy the texture/FBO.
   SwapBuffers()
}

Compile, run and press 's'.  At some point, the purple cube rendering stops; you'll just see blank red or green background rects.  Then, it may recover, or the screen corruption will appear, or the GPU will lock up.  When the screen gets currupted the bad pixels are purple which tells me that the cube's rendering is not going to the FBO.


I enabled intel_bufmgr's debug output and noticed that just before the problem occurs, there's a very long stream of:

bo_unreference final: 251 (SS_SURF_BIND)
bo_unreference final: 254 (SS_SURF_BIND)
bo_unreference final: 1331 (SS_SURF_BIND)
bo_unreference final: 308 (SS_SURF_BIND)
bo_unreference final: 158 (SS_SURF_BIND)
bo_unreference final: 1043 (SS_SURF_BIND)
[about 1400 of them]


Also, in brw_update_renderbuffer_surface(), I noticed that the FBO region's bo's "offset" switches from always begin zero to something like 161230848.

Comment 1 Shuang He 2009-09-03 22:19:46 UTC

We also met this issue with OGLC/fbo.c

Comment 2 Shuang He 2009-09-03 22:22:34 UTC

Created attachment 29199 [details]
GPU dump when fbo.c hang GPU

Comment 3 Eric Anholt 2009-09-06 17:32:54 UTC

Since the surface state cached in that hunk of state cache keeps refs on buffers, that state cache flushing is the first time that the render target BOs get into the bo reuse cache and then returned to a new allocation, so that's the first time you're reusing a surface buffer and seeing an object with a previous offset in it.

This bug does appear to go away with bo_reuse=false, and I've been poking at figuring out how our caching could be going wrong.  texture_tiling=false doesn't help, always_flush_batch=true always_flush_cache=true doesn't help.

Comment 4 Eric Anholt 2009-09-07 11:26:26 UTC

Removing the set_tiling from bo_reuse in libdrm fixes it, at the cost of performance.

Comment 5 Eric Anholt 2009-09-09 13:02:22 UTC

commit 5604b27b9326ac542069a49ed9650c4b0d3e939a
Author: Eric Anholt <eric@anholt.net>
Date:   Wed Sep 9 12:35:30 2009 -0700

    i965: Fix relocation delta for WM surfaces.
    
    This was a regression in 0f328c90dbc893e15005f2ab441d309c1c176245.
    
    Bug #23688
    Bug #23254

Comment 6 Brian Paul 2009-09-14 11:49:51 UTC

Confirmed that this fixes the test prog and other related failures.  Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.