Created attachment 129135 [details]
apitrace of my program
When I'm doing a course project in my compute graphics class, I found my program didn't fully work on nouveau while it was working fine on intel and nvidia proprietary drivers. It is a program using deferred shading with some shadow map techniques, and it has an option to draw a "histogram" of depth distribution sampled by a compute shader. I found histogram was showing wrong results (totally blank) on nouveau.
After looking into the problem for a while, I happened to find that if I didn't render the histogram (i.e. didn't use compute shader to sample) on the first frame, the histogram would go wrong, otherwise, it would be okay.
This problem can be reproduced with the attached apitrace file, recorded with simplified program and trimmed to contain only 4 frames. The first 2 frames are mainly initialization and the third frame renders the scene from one angle without generating histogram. The last frame renders the scene from another angle and generates a histogram at bottom-right edge, which is completely blank on nouveau. However, if the trace is further trimmed to contain frame 0,1,3, the histogram will render correctly. (Results of the two will be attached below)
I'm willing provide assistance if needed.
Created attachment 129136 [details]
Wrong rendering (histogram blank)
Created attachment 129137 [details]
Correct rendering of the last frame
On GM107 it actually appears to work correctly. However on GK208 I got a blank histogram on the first run, and then a semi-random one on subsequent runs. (And now it's blank again.)
What GPU are you using?
(In reply to Ilia Mirkin from comment #3)
> On GM107 it actually appears to work correctly. However on GK208 I got a
> blank histogram on the first run, and then a semi-random one on subsequent
> runs. (And now it's blank again.)
> What GPU are you using?
Mine is a GT740M (GK208).
Can you figure out which compute invocation messes up? Looks like you have 2 compute shaders in there, although my bet is on the one that does all the atomic ops on shared memory - that stuff changed between Kepler and Maxwell, which could explain why it works on GM107.
Oh wait - while there are 2 compute shaders, you only use one of them (the one with all the shared atomics). Also need to double-check that it's not an earlier draw messing things up (e.g. that the compute shader is getting the proper inputs). But my bet is on an issue in the compute shader.
(In reply to Ilia Mirkin from comment #6)
> Oh wait - while there are 2 compute shaders, you only use one of them (the
> one with all the shared atomics). Also need to double-check that it's not an
> earlier draw messing things up (e.g. that the compute shader is getting the
> proper inputs). But my bet is on an issue in the compute shader.
I agree with you, since the same compute shader input was also used in subsequent rendering. If that were wrong, the following render would also be incorrect.
A missing OP_CCTL maybe? I have something which could help (but it's definitely not the right thing to do).
The possible fix is here:
That might explain the failure on kepler (and presumably fermi) because maxwell doesn't need that.
(In reply to Samuel Pitoiset from comment #8)
> A missing OP_CCTL maybe? I have something which could help (but it's
> definitely not the right thing to do).
> The possible fix is here:
Nope. It does change the rendering very slightly - there's now one bar consistently in the histogram - still incorrect.
Created attachment 129157 [details]
The compute shader binary and assembly compiled with nouveau
I took some time today to capture the compiled shader binaries under nouveau and proprietary driver. It seems there are differences when handling atomic operations. The code generated by nvidia driver (will be attached below) is rather huge, but seems mostly repetitions of several small sections of code.
I don't know about nvidia isa, but hope it helps.
Created attachment 129158 [details]
The compute shader binary and assembly compiled with proprietary driver
Now I doubt if the problem really lies in the atomic operations, even in compiler at all. I changed the compute shader a little bit, only letting thread (0, 0) to accumulate its result to the ssbo (the dividing factor in histogram fragment shader changed from 20000 to 20). This bypasses atomic operation and shared memory, but the same problem persists.
I suspect that the input state of compute shader is somehow invalid, and that explains why removing one frame of the trace will magically eliminate the problem.
Created attachment 129380 [details]
A minimal reproducer
I wrote another much simpler program today to find the real cause of this bug. Turns out that compute shader is unable to sample from depth texture.
The attached program first renders a shape to a FBO. Then a compute shader is used to copy the color and depth textures attached to the former FBO to a shader image, which is drawn to the screen. The depth part is different between nouveau and other drivers.
Compile the program with gcc test.c -lglfw -lGLEW -lGL
Created attachment 129381 [details]
Correct shader image (i965)
Created attachment 129382 [details]
Incorrect shader image (nouveau)
A few observations:
- Pretty sure you want GL_TEXTURE_FETCH_BARRIER_BIT. Not that this matters for nvc0.
- In addition to this always working on GM107, it *sometimes* works on my GK208. Something about card state? Or something in compute insufficiently initialized?
- I tried setting the DEPTH_TEXTURE bit in the TIC with no effect.
- The GM107 works just as effectively with both old- and new-style TIC formats.
- Throwing glFlush or glTextureBarrier into the mix made no difference, so it's not a cache issue.
- Switching to DEPTH24 and DEPTH32F made no difference.
Also likely relevant is the fact that all these fail, even on GM107. But the ones for other shader stages do pass. We're clearly missing some setting of some kind.
Author: Ilia Mirkin <email@example.com>
Date: Mon Feb 13 11:14:51 2017 -0500
nvc0: disable linked tsc mode in compute launch descriptor
Empirically, this makes things work. Presumably this was originally
copied from the blob, which does make use of linked tsc mode.
Signed-off-by: Ilia Mirkin <firstname.lastname@example.org>
Reviewed-by: Samuel Pitoiset <email@example.com>