Bug 97305

Summary: Gallium: TBOs and images set the offset in elements, not bytes
Product: Mesa Reporter: Matias N. Goldberg <dark_sylinc>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: critical    
Priority: medium CC: vedran
Version: git   
Hardware: All   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Proof of the bug

Description Matias N. Goldberg 2016-08-11 19:53:09 UTC
I'm the main developer of Ogre 2.1

We're trying our demos in latest Mesa:
I've built Mesa from scratch. Tried commits git-edfc17a (current head of branch 12.0) and commit 17f1c49b9ad05af4f6482f6fa950e5dcc1a779d1 (current head of master branch)

Almost all of them work except for the compute shader based ones (Googling around it appears for Southern Island radeon it could be buggy so it was turned off. Not 100% sure but I think that's what's happening).
Anyway, not a big deal on that. I understand the Mesa team considers that WIP. Move on.

However our Forward3D demo was glitching and I didn't know why. I've investigated the problem and located it:
* glGetIntegerv w/ GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT is returning 4, should return 256 for my HW (GCN 1.0 AMD Radeon HD 7770 Southern Islands).
* glGetIntegerv w/ GL_TEXTURE_BUFFER_OFFSET_ALIGNMENT is returning 4, should return 256 for my HW.

In both cases "256" is what Windows drivers return btw, including fglrx on Linux.

When aligned to 4 bytes; the following worked correctly:
glTexBufferRange( GL_TEXTURE_BUFFER, GL_R32UI, boName, offsetAlignedTo4ButNotTo256, sizeBytes );
uniform usamplerBuffer f3dGrid;
uint numLightsInGrid = texelFetch( f3dGrid, int(sampleOffset) ).x;

But the following did not:
glTexBufferRange( GL_TEXTURE_BUFFER, GL_RGBA32F, boName, offsetAlignedTo4ButNotTo256, sizeBytes );
uniform samplerBuffer f3dLightList;
vec4 values = texelFetch( f3dLightList, int(sampleOffset) ).x;

The output of the latter was flickering garbage. When I forced 256 byte alignment, it worked as expected.

I consider this a critical issue because I suspect it's causing subtle problems with a lot of titles whose buffer offsets don't start at 0.

I believe NVIDIA GPUs also require buffer offsets of 256. I don't know about Intel cards.
Comment 1 Ilia Mirkin 2016-08-11 20:06:00 UTC
(In reply to Matias N. Goldberg from comment #0)
> But the following did not:
> glTexBufferRange( GL_TEXTURE_BUFFER, GL_RGBA32F, boName,
> offsetAlignedTo4ButNotTo256, sizeBytes );
> uniform samplerBuffer f3dLightList;
> vec4 values = texelFetch( f3dLightList, int(sampleOffset) ).x;
> 
> The output of the latter was flickering garbage. When I forced 256 byte
> alignment, it worked as expected.

My guess is that 16 would have worked as well. It just has to be aligned to the texel. 4 works for GL_RGBA8 but not GL_RGBA32*. Unfortunately there's no provision for a per-format limit.
Comment 2 Nicolai Hähnle 2016-08-11 20:17:39 UTC
I agree with Ilia. There should be no requirement for an alignment greater than the "texel" size, that really wouldn't make sense given how the hardware works. Can you please try with 16-but-not-256 alignment?
Comment 3 Matias N. Goldberg 2016-08-11 20:30:54 UTC
>> BEGIN PATCH

diff --git a/src/gallium/drivers/radeonsi/si_pipe.c b/src/gallium/drivers/radeonsi/si_pipe.c
index 88f4f20..6a2d5bc 100644
--- a/src/gallium/drivers/radeonsi/si_pipe.c
+++ b/src/gallium/drivers/radeonsi/si_pipe.c
@@ -363,10 +363,11 @@ static int si_get_param(struct pipe_screen* pscreen, enum pipe_cap param)
 
 	case PIPE_CAP_CONSTANT_BUFFER_OFFSET_ALIGNMENT:
 	case PIPE_CAP_TEXTURE_BUFFER_OFFSET_ALIGNMENT:
+		return 256;
 	case PIPE_CAP_MAX_TEXTURE_GATHER_COMPONENTS:
 		return 4;
 	case PIPE_CAP_SHADER_BUFFER_OFFSET_ALIGNMENT:
-		return HAVE_LLVM >= 0x0309 ? 4 : 0;
+		return HAVE_LLVM >= 0x0309 ? 256 : 0;
 
 	case PIPE_CAP_GLSL_FEATURE_LEVEL:
 		if (pscreen->get_shader_param(pscreen, PIPE_SHADER_COMPUTE,

<< END PATCH


I'm playing safe by going with 256 bytes with everyone, based on the overwhelming amount of reports (on the left top there's a dropdown, select "All"); unless someone wants to go through the trouble of identifying each card. Even in DX12 256 alignment is by spec:
http://delphigl.de/glcapsviewer/gl_stats_caps_single.php?listreportsbycap=GL_UNIFORM_BUFFER_OFFSET_ALIGNMENT
http://delphigl.de/glcapsviewer/gl_stats_caps_single.php?listreportsbycap=GL_TEXTURE_BUFFER_OFFSET_ALIGNMENT
http://delphigl.de/glcapsviewer/gl_stats_caps_single.php?listreportsbycap=GL_SHADER_STORAGE_BUFFER_OFFSET_ALIGNMENT

I believe this patch could potentially increase radeonsi's compatibility greatly and should be tested against popular game titles.
Comment 4 Matias N. Goldberg 2016-08-11 20:31:44 UTC
Sure I will try 16 but not 256
Comment 5 Marek Olšák 2016-08-11 22:12:15 UTC
Pre-GCN cards require an alignment of 256 bytes for uniform buffers. GCN cards only require 4 byte alignment there.

All other buffer binding points require 1, 2, or 4 bytes depending on the texel size. It's really min(texel_size, 4).

Whatever the closed OpenGL driver reports doesn't correspond to the reality.

The limits reported by our Vulkan driver are very similar to Mesa.
Comment 6 Matias N. Goldberg 2016-08-11 22:18:38 UTC
I'm back.

I only tested texture and uniform buffers, not SSBOs.

Results:
* Setting UBOs to 4 byte alignments didn't seem to matter in any way. Performance didn't seem to be affected either. Nonetheless given that AMD always sets it to 256; this is what I'd recommend (i.e. there could be undocumented hw errors for edge cases?)

* Setting TBOs to non-16 byte alignments breaks in a specific pattern (i.e. 4 byte alignments produce certain glitches, 8 byte alignments produce a different set of glitches, 12 byte alignments a different set, etc). Setting TBOs to 16 byte alignments fixes everything. I benchmarked 256-byte alignments vs 16-byte alignments-non-256-byte and couldn't see any performance difference (the sample kept around 46.50-47.00ms all the time).
Comment 7 Matias N. Goldberg 2016-08-11 22:22:15 UTC
Our code is complex so if you keep insisting 4 byte alignment is correct, I won't deny the possibility it could be a bug in our codebase. I would have to create a simple as possible case to double check it.
Comment 8 Marek Olšák 2016-08-11 22:28:29 UTC
> Nonetheless given that AMD always sets it to 256 ...

Not true. The Mesa driver is also developed by us. Nicolai and me work for AMD. There is also the Vulkan driver which returns "1 byte" for the texture buffer alignment.
Comment 9 Ilia Mirkin 2016-08-11 22:39:55 UTC
(In reply to Marek Olšák from comment #5)
> Pre-GCN cards require an alignment of 256 bytes for uniform buffers. GCN
> cards only require 4 byte alignment there.
> 
> All other buffer binding points require 1, 2, or 4 bytes depending on the
> texel size. It's really min(texel_size, 4).

That means that you have to return max(texel_size) for the GL_TEXTURE_BUFFER_OFFSET_ALIGNMENT. You only get one cap, and a texture buffer can be any offset.
Comment 10 Marek Olšák 2016-08-11 22:42:41 UTC
(In reply to Ilia Mirkin from comment #9)
> (In reply to Marek Olšák from comment #5)
> > Pre-GCN cards require an alignment of 256 bytes for uniform buffers. GCN
> > cards only require 4 byte alignment there.
> > 
> > All other buffer binding points require 1, 2, or 4 bytes depending on the
> > texel size. It's really min(texel_size, 4).
> 
> That means that you have to return max(texel_size) for the
> GL_TEXTURE_BUFFER_OFFSET_ALIGNMENT. You only get one cap, and a texture
> buffer can be any offset.

That's what we already do.
Comment 11 Ilia Mirkin 2016-08-11 22:50:41 UTC
(In reply to Marek Olšák from comment #10)
> (In reply to Ilia Mirkin from comment #9)
> > (In reply to Marek Olšák from comment #5)
> > > Pre-GCN cards require an alignment of 256 bytes for uniform buffers. GCN
> > > cards only require 4 byte alignment there.
> > > 
> > > All other buffer binding points require 1, 2, or 4 bytes depending on the
> > > texel size. It's really min(texel_size, 4).
> > 
> > That means that you have to return max(texel_size) for the
> > GL_TEXTURE_BUFFER_OFFSET_ALIGNMENT. You only get one cap, and a texture
> > buffer can be any offset.
> 
> That's what we already do.

You return 4. However that's not enough if I set the format to GL_RGBA32I. There's no provision in the spec to disallow someone feeding you a aligned-to-4 buffer with GL_RGBA32I format if you return "4" as the value.

Oh, unless I totally misunderstood what you were saying, and that situation should work. In which case, apologies for the noise.
Comment 12 Matias N. Goldberg 2016-08-11 22:54:59 UTC
Created attachment 125723 [details]
Proof of the bug
Comment 13 Matias N. Goldberg 2016-08-11 23:08:40 UTC
I've added a "Proof of the bug".

There IS a bug. I don't know if it's software or in the hardware. The attached sample does the following:

1. Create a 1MB buffer.
2. Fill the first 4kb with the pattern:
data[i+0] = 1.0f;
data[i+1] = 0.5f;
data[i+2] = 0.15f;
data[i+3] = 0.0f;
3. Attach it via:
glActiveTexture( GL_TEXTURE0 );
glBindTexture( GL_TEXTURE_BUFFER, texName );
glTexBufferRange( GL_TEXTURE_BUFFER, GL_RGBA32F, vboName,
		  elementStart*sizeof(float),
		  1024*4*sizeof(float) );
++elementStart;
4. Render a triangle showing the colours.

If the alignment is not a problem, I should see Orange-Cream, then Carmine, then blue-violetish, then green; repeat.

Instead I **ALWAYS** see Orange-Cream; which is clearly incorrect.
After running the sample long enough, elementStart will be > 1024 (thus 1024*4 exceeding 4kb) and begins to display black. (and if you leave it long enough running it could crash or throw an error, that's not the point)
Comment 14 Marek Olšák 2016-08-11 23:48:26 UTC
Thanks. It's not a hardware issue. It's a driver design issue.
Comment 15 Marek Olšák 2016-08-12 00:01:24 UTC
I'm working on it.
Comment 16 Vedran Miletić 2016-08-29 15:05:28 UTC
(In reply to Marek Olšák from comment #15)
> I'm working on it.

Any news?
Comment 17 Marek Olšák 2016-08-29 15:15:32 UTC
It should be fixed in master already.
Comment 18 Matias N. Goldberg 2016-08-29 17:42:22 UTC
You're awesome!

I can confirm it works on both the sample repro I provided and my actual real case.

Small note: We discovered there was a potential for a crash because at certain points we assume SIMD alignment; so we're now enforcing 16-byte alignment as minimum in our code. I'm doing this remark because Mesa is the only (or among the very few) implementations that return alignments <16; which can reveal hidden bugs in program code that runs fine everywhere else. If you ever see a bug report of these sorts (e.g. crashing in movaps, movntdqa), this is something to keep in mind.

I appreciate the 4 byte alignment as 256 reported by others is excessively wasteful. Thanks.

For the record it was fixed in
https://cgit.freedesktop.org/mesa/mesa/commit/?id=7cd256ce7e4bad680bb77d033cf5dd662abab2dd
and
https://cgit.freedesktop.org/mesa/mesa/commit/?id=325379096f54dde39171d1b8804e29a8003bb3c7

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.