Bug 99780 - Flickering artifacts in radeonsi driver with divergent texture reads.
Summary: Flickering artifacts in radeonsi driver with divergent texture reads.
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-12 15:34 UTC by Matias N. Goldberg
Modified: 2017-03-15 13:46 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Sample repro (2.26 MB, application/x-7z-compressed)
2017-02-12 15:34 UTC, Matias N. Goldberg
Details
Pixel Shader used to receive these shadows (already stripped down & with tint) (7.13 KB, text/plain)
2017-02-12 15:38 UTC, Matias N. Goldberg
Details
Vertex Shader used to receive these shadows (already stripped down & with tint) (3.43 KB, text/plain)
2017-02-12 15:39 UTC, Matias N. Goldberg
Details
IR Generated by LLVM 3.9.1 (7.40 KB, text/plain)
2017-02-12 16:19 UTC, Matias N. Goldberg
Details
IR Generated by LLVM 4.0 (9.10 KB, text/plain)
2017-02-12 16:20 UTC, Matias N. Goldberg
Details
ASM 4.0 (9.21 KB, text/plain)
2017-02-12 16:30 UTC, Matias N. Goldberg
Details
ASM from 3.9.1 (6.23 KB, text/plain)
2017-02-12 16:31 UTC, Matias N. Goldberg
Details
Mesa patch (3.61 KB, patch)
2017-02-16 16:24 UTC, Nicolai Hähnle
Details | Splinter Review

Description Matias N. Goldberg 2017-02-12 15:34:55 UTC
Created attachment 129530 [details]
Sample repro

I'm an Ogre3D developer.

We noticed after updating to LLVM 4.0 that there were visual glitches:
http://imgur.com/43tIuI6
http://imgur.com/iBhxoBj
(this glitches tend to flicker over time *even if everything is stationary*)
This bug is *NOT* present if using LLVM 3.9.1
This bug happens with radeon open source drivers. Using llvmpipe works as intended.

I tried to reproduce this bug in a clean environment but failed. So instead I stripped down as much as possible to a minimal repro sample out of our code; which I am attaching.

We noticed that this flicker happens alongside the difference between the PSSM splits, but not always:
http://imgur.com/130B4mE
We added a tint to differentiate the splits. But as you can see there are a few artifacts that happens within one tint; though the vast majority of the artifacts happens alongside the line.

A PSSM split in shader code means this:
uniform sampler2DShadow texShadowMap[2];
float fShadow = 1.0;
vec3 tint = vec3( 1, 1, 1 );
if( inPs.depth <= pass.pssmSplitPoints0 )
{
    fShadow = getShadow( texShadowMap[0], inPs.posL0, vec4( 0.000488281, 0.000488281, 1, 1 ) );
    tint = vec3( 0.0, 0, 1 );
}
else if( inPs.depth <= pass.pssmSplitPoints1 )
{
    fShadow = getShadow( texShadowMap[1], inPs.posL1, vec4( 0.000976562, 0.000976562, 1, 1 ) );
    tint = vec3(0.0, 1.0, 0.0 );
}

Everything from inPs comes from the vertex shader.

After stripping down as much as possible, our getShadow declaration is this:
float getShadow( sampler2DShadow shadowMap, vec4 psPosLN, vec4 invShadowMapSize )
{
	float fDepth = psPosLN.z;
	vec2 uv = psPosLN.xy / psPosLN.w;

	float retVal = 0;

	vec2 fW;
	vec4 c;
	
	retVal += texture( shadowMap, vec3( uv, fDepth ) ).r;

	return retVal;
}


I've seen a similar artifact before in the Windows AMD drivers back when I was experimenting with bindless textures and I purposedly managed to mix textures from divergent branches into the same wavefront. Not sure if that's what's happening here though.

Just run launchapp.sh; select OpenGL renderer, and click accept. It will launch the sample.

NOTE: If when running the sample you encounter this problem:
Created GL 4.3 context
libGL error: failed to create drawable
libGL error: failed to create drawable
X Error of failed request:  0
  Major opcode of failed request:  156 (GLX)
  Minor opcode of failed request:  26 (X_GLXMakeContextCurrent)
  Serial number of failed request:  28
  Current serial number in output stream:  28

The just run it again. I do not know why this happens, it happens rarely (like 1 in 10 runs) and I haven't yet been able to determine why. It sounds like a race condition inside Mesa (because the bug won't manifest while debugging where loading is significantly slowed down loading all symbols), but I don't have enough evidence to back this up, and it's not the reason of this bug ticket. 

If you have more issues running the sample let me know, I can provide assistance. This sample needs SDL2 to be installed in your system.

I'm using AMD Radeon HD 7770 1GB
Ubuntu 16.10
Mesa git 5bc222ebafddd14f2329f5096287b51d798a6431
LLVM 4.0 SVN 293947
Kernel 4.8.0-37-generic
Comment 1 Matias N. Goldberg 2017-02-12 15:38:52 UTC
Created attachment 129531 [details]
Pixel Shader used to receive these shadows (already stripped down & with tint)
Comment 2 Matias N. Goldberg 2017-02-12 15:39:18 UTC
Created attachment 129532 [details]
Vertex Shader used to receive these shadows (already stripped down & with tint)
Comment 3 Matias N. Goldberg 2017-02-12 16:19:47 UTC
Created attachment 129533 [details]
IR Generated by LLVM 3.9.1

Took me a while to find the IR out of all the noise.

You should consider adding a Mesa extension to give human-readable names to shaders. It would help everyone in debugging A LOT.

I don't mind having
glSetNameMESA( shaderId, "My Filename" );
and internally Mesa keeps a std::map<int, string> or similar.
Comment 4 Matias N. Goldberg 2017-02-12 16:20:14 UTC
Created attachment 129534 [details]
IR Generated by LLVM 4.0
Comment 5 Matias N. Goldberg 2017-02-12 16:30:31 UTC
Created attachment 129535 [details]
ASM 4.0

Those readfirstlane instructions are incredibly suspicious
Comment 6 Matias N. Goldberg 2017-02-12 16:31:07 UTC
Created attachment 129536 [details]
ASM from 3.9.1
Comment 7 Matias N. Goldberg 2017-02-12 17:34:21 UTC
Analyzing the ASM it's clear what's happening: the 3.9.1 version branches everything.
The 4.0 version branches the calculations, then at the end uses readfirstlane (this is WRONG) and finally performs one sample.
Comment 8 Bas Nieuwenhuizen 2017-02-12 17:54:28 UTC
I think this is the issue described in 

https://reviews.llvm.org/D26348

Not idea when the fix is going to go through though.
Comment 9 Matias N. Goldberg 2017-02-12 23:42:58 UTC
Thanks for the referral.

I commented there, and they replied back with a potential fix:

> I believe it is very easy to fix, no need to change LLVM: marking the texture operation as having unknown side-effect should be enough.

Any comments? (perhaps you already know this but the performance hit could be huge?)
Comment 10 Matias N. Goldberg 2017-02-15 15:21:52 UTC
I've built LLVM 5.0 from latest SVN.
I've applied the patch from https://reviews.llvm.org/D26348

The issue still persists. I don't know if anything else was needed (e.g. changes to radeonsi to make use of the patch?).
Comment 11 Nicolai Hähnle 2017-02-16 16:24:55 UTC
Created attachment 129675 [details] [review]
Mesa patch

Yes, you need this patch for Mesa as well.

I'm going to pick up the LLVM patch again as well these days.
Comment 12 Matias N. Goldberg 2017-02-16 18:47:31 UTC
I can confirm applying the patch you uploaded (plus LLVM's patch of course) fixes the problem.

Thank you.
Comment 13 Samuel Pitoiset 2017-03-15 13:46:05 UTC
Just pushed a workaround for this issue in Mesa.

https://cgit.freedesktop.org/mesa/mesa/commit/?id=7751ed39e40e08e5aa0633d018c9f25ad17f9bb0


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.