25806 – NV40 vertex corruption (kernel BO deletion too early?)

Bug 25806 - NV40 vertex corruption (kernel BO deletion too early?)

Summary: NV40 vertex corruption (kernel BO deletion too early?)

Status:	RESOLVED INVALID

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/nouveau (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2009-12-27 13:45 UTC by Luca Barbieri
Modified:	2009-12-27 14:25 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments

Description Luca Barbieri 2009-12-27 13:45:10 UTC

On my G71 system, several programs show vertex corruption issues. In particular, vertices tend to be corrupted or randomly go to infinity, leading to spiked triangles or random polygons, in several programs, such as demos/engine, demos/dinoshade, Blender, Extreme Tux Racer.

The system is running:
Linux 2.6.33-rc2
libdrm 2.4.17
Mesa HEAD (b46bcd8e7b37aa2e9159e126c1cc88234a3c2790)
Detected an NV40 generation card (0x049800a2)
64 MB GART aperture
256 MB VRAM

The problem is solved by either of the following:
1. #define FORCE_SWTNL 1
2. Adding usleep(10000) at the end of nv40_draw_arrays
3. Making nouveau_screen_bo_del do nothing

It seems that the issue is that Mesa deletes a buffer object used for vertex data while the GPU is still drawing to it. The kernel actually performs the deletion without waiting for the GPU drawing, the memory (or GART mapping) is reused, and corruption ensues.

From Gallium tracing, Mesa is sending vertex data in 64 KB buffers, which are created, written, drawn and then recreated upon reuse (which seems correct behavior).

It seems, in other words, that the kernel is not keeping an extra reference to buffers which are currently referenced by an in-flight pushbuffer, and unreferencing them only once the GPU finished drawing.

Is the kernel already supposed to do so?
If yes, something is broken. If things work for others, maybe my system is somehow more prone to reusing memory or GART mappings, so they don't see that?

If no, then how are things supposed to work?

(BTW, not freeing buffers leads to X freezing and the kernel oopsing on my machine upon saturating memory, but that's another issue)

Comment 1 Luca Barbieri 2009-12-27 14:13:14 UTC

Upon further examination, the kernel does seem to have the required logic: sending a pushbuffer creates a fence, which is put in bo->sync_obj, which is then checked on deletion and if non-null, the buffer is put on a delayed destroy list.

However, it seems to be somehow not working.

Maybe fencing is broken on my card? (i.e. the kernel thinks fences are signaled when they aren't)
Or possibly fences are being signaled before the vertex shader is finished running?

How can I test that fencing is working correctly?

Comment 2 Francisco Jerez 2009-12-27 14:25:30 UTC

(In reply to comment #1)
> Upon further examination, the kernel does seem to have the required logic:
> sending a pushbuffer creates a fence, which is put in bo->sync_obj, which is
> then checked on deletion and if non-null, the buffer is put on a delayed
> destroy list.
> 
> However, it seems to be somehow not working.
> 
> Maybe fencing is broken on my card? (i.e. the kernel thinks fences are signaled
> when they aren't)
> Or possibly fences are being signaled before the vertex shader is finished
> running?

That would be almost unprecedented... it's more likely that some caches in the GPU aren't being flushed often enough (or maybe the ones in the CPU... a bug in the kernel PAT code also used to cause the same symptoms, but that's hopefully already fixed).

I'm marking this as invalid because that's the current policy, unfortunately we're already aware of too many gallium bugs.

> 
> How can I test that fencing is working correctly?
>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.