Bug 93594

Summary: Flickering Shadows in The Talos Principle
Product: Mesa Reporter: Christoph Haag <haagch>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: alexander, EoD, maldela, wade
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: WholeQuadMode pass patch for LLVM

Description Christoph Haag 2016-01-05 15:16:56 UTC
I haven't seen specifically this bug mentioned here yet. Apparently this has been an issue for some time and with latest mesa git and llvm svn 256490 it still happens.

Video demonstrating the flickering (ignore the long pause): https://www.youtube.com/watch?v=UyZu_71LgGE

My hardware:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Wimbledon XT [Radeon HD 7970M] (rev ff)

Here is a trace (657MB uncompressed, 409MB compressed): http://haagch.frickel.club/files/talos.trace.xz

Replaying the trace on radeonsi shows flickering shadows.

When replaying the trace on intel, the shadows don't seem to flicker.
Comment 1 Michel Dänzer 2016-01-06 03:01:14 UTC
(In reply to Christoph Haag from comment #0)
> When replaying the trace on intel, the shadows don't seem to flicker.

Not at all? I see pretty bad flickering on my Verde, but while I can still see some on my Kaveri, it's hardly noticeable. So this might be at least partly an SI specific issue.
Comment 2 Christoph Haag 2016-01-06 18:59:36 UTC
I'm not completely sure. The performance is very bad (at the same settings) and on intel there is another bug that makes the whole scene go dark at random perspectives. Here is a video how the game looks with the same settings on intel:
https://www.youtube.com/watch?v=7ZcqdLQupDc
I think the shadows are "vanishing" in and out a bit too, but the intense flicker like on radeonsi isn't there.
Comment 3 Ilia Mirkin 2016-01-06 19:08:20 UTC
FTR you may be interested in bug 90513 -- long story short, semi-similar (but based on the video, not identical) symptoms on nouveau/nvc0.
Comment 4 EoD 2016-01-07 08:58:55 UTC
I rendered the apitrace on llvmpipe. If there is still flickering it is much less compared to the radeonsi replay: https://www.youtube.com/watch?v=InHA4RAVvfM
Comment 5 EoD 2016-01-07 23:45:30 UTC
(In reply to Michel Dänzer from comment #1)
> (In reply to Christoph Haag from comment #0)
> > When replaying the trace on intel, the shadows don't seem to flicker.
> 
> Not at all? I see pretty bad flickering on my Verde, but while I can still
> see some on my Kaveri, it's hardly noticeable. So this might be at least
> partly an SI specific issue.

While changing a few settings I observed, the lower the "Max Shadow Size" setting is, the easier you can see the heavy flickering.
When you set the setting "No dynamic lightning", all the flicker is gone.

I made a new and smaller apitrace with every setting on LOWEST. The new apitrace shows the difference between a scene with dynamic lightning (no flickering) and the same scene without dynamic lightning (massive flickering).

https://drive.google.com/file/d/0B42CQY2wGHumalYyWEJNRXAzWFE/view?usp=sharing

It evens runs semi-smoothly on llvmpipe.
Comment 6 EoD 2016-01-07 23:50:16 UTC
(In reply to EoD from comment #5)
> I made a new and smaller apitrace with every setting on LOWEST. The new
> apitrace shows the difference between a scene with dynamic lightning (no
> flickering) and the same scene without dynamic lightning (massive
> flickering).
> 
> https://drive.google.com/file/d/0B42CQY2wGHumalYyWEJNRXAzWFE/view?usp=sharing

Correction. It should be called:

The new apitrace shows the difference between a scene without dynamic lightning (no flickering) and the same scene with dynamic lightning (massive flickering).

The setting is called "No Dynamic Lightning" (On = Good, Off = Flickering)
Comment 7 EoD 2016-02-14 04:21:08 UTC
FTR the apitrace replays fine with r600 on my Barts.
Comment 8 Marek Olšák 2016-02-17 01:17:50 UTC
I've captured one of the problematic draw calls, which used this fragment shader:

FRAG
DCL IN[0], GENERIC[9], PERSPECTIVE
DCL OUT[0], COLOR
DCL SAMP[0]
DCL SVIEW[0], 2D, FLOAT
DCL CONST[0..13]
DCL CONST[15]
DCL TEMP[0..3], LOCAL
IMM[0] FLT32 {    1.0000,     0.0000,     0.2500,     4.0000}
  0: MOV TEMP[0].xy, IN[0].xyyy
  1: TEX TEMP[0].w, TEMP[0], SAMP[0], 2D
  2: MUL TEMP[0].x, TEMP[0].wwww, IN[0].wwww
  3: MAD TEMP[1].x, TEMP[0].xxxx, CONST[1].xxxx, CONST[1].yyyy
  4: MOV_SAT TEMP[1].x, TEMP[1].xxxx
  5: MOV TEMP[0].w, TEMP[1].xxxx
  6: FSLT TEMP[1].x, TEMP[1].xxxx, CONST[0].xxxx
  7: AND TEMP[1].x, TEMP[1].xxxx, IMM[0].xxxx
  8: KILL_IF -TEMP[1].xxxx
  9: MAD TEMP[0].x, IN[0].zzzz, CONST[0].zzzz, CONST[0].yyyy
 10: MOV_SAT TEMP[1].x, TEMP[0].xxxx
 11: DDX TEMP[2].x, TEMP[1].xxxx
 12: ABS TEMP[2].x, TEMP[2].xxxx
 13: MUL TEMP[3], CONST[15].xxxx, TEMP[1].xxxx
 14: DDY TEMP[3].x, TEMP[3]
 15: ABS TEMP[3].x, TEMP[3].xxxx
 16: ADD TEMP[2].x, TEMP[2].xxxx, TEMP[3].xxxx
 17: MAD TEMP[0].x, TEMP[2].xxxx, IMM[0].zzzz, TEMP[1].xxxx
 18: MUL TEMP[1].x, TEMP[0].xxxx, TEMP[0].xxxx
 19: ADD TEMP[1].x, TEMP[0].xxxx, -TEMP[1].xxxx
 20: MUL TEMP[1].x, IMM[0].wwww, TEMP[1].xxxx
 21: ADD TEMP[1].x, IMM[0].xxxx, -TEMP[1].xxxx
 22: MOV TEMP[0].y, TEMP[1].xxxx
 23: MOV TEMP[0].z, IMM[0].yyyy
 24: MOV OUT[0], TEMP[0]
 25: END

If KILL_IF masks out some but not all invocations in a 2x2 quad, the subsequent DDX and DDY opcodes can result in undefined values, resulting in garbage on the output.

The simple solution is to move KILL_IF to the end of the shader. I've verified that it works, but it's inefficient. The best solution would be to:
- save the resulting EXEC mask after KILL_IF
- use S_WQM on the exec mask to get a whole-quad mask
- execute DDX and DDY
- restore the EXEC mask (this must be done after both DDX & DDY but before PS exports)
Comment 9 Michel Dänzer 2016-02-17 01:55:01 UTC
(In reply to Marek Olšák from comment #8)
> If KILL_IF masks out some but not all invocations in a 2x2 quad, the
> subsequent DDX and DDY opcodes can result in undefined values, resulting in
> garbage on the output.

I think Nicolai is making some improvements for the EXEC mask handling for shader images. Maybe that'll help for this as well.

Note that AFAIK using things like derivatives in non-uniform control flow isn't supported by GLSL. What's the original GLSL shader?
Comment 10 Marek Olšák 2016-02-17 16:51:46 UTC
(In reply to Michel Dänzer from comment #9)
> Note that AFAIK using things like derivatives in non-uniform control flow
> isn't supported by GLSL. What's the original GLSL shader?

True.

The GLSL shader is using discard followed by fwidth. This is undefined behavior.

Therefore, it's an application bug.

The easy workaround would be to disable register allocation in st/mesa to get a quasi-SSA form and trivially move KILL_IF to the end of the shader if the app is detected to be Talos Principle.
Comment 11 Marek Olšák 2016-02-17 17:21:50 UTC
Also, I think we shouldn't do anything in the shader compiler. A simple workaround in Mesa should be enough.
Comment 12 Roland Scheidegger 2016-02-17 21:12:29 UTC
(In reply to Marek Olšák from comment #10)
> The GLSL shader is using discard followed by fwidth. This is undefined
> behavior.
> 
> Therefore, it's an application bug.

I wasn't aware of this difference, but seeing this bug made me suspicious, and indeed d3d10 is saying for discard: https://msdn.microsoft.com/en-us/library/windows/desktop/hh446968%28v=vs.85%29.aspx
"This instruction flags the current pixel as terminated, while continuing execution, so that other pixels executing in parallel may obtain derivatives if necessary. Even though execution continues, all Pixel Shader output writes before or after the discard instruction are discarded."

That's very interesting...

gallium docs don't actually say anything about this neither naturally. (I think it should work in llvmpipe, because indeed we only update the pixel alive mask.)

Due to d3d10 being different there, I wouldn't be surprised if other apps make the same mistake.
Comment 13 Marek Olšák 2016-02-17 23:20:09 UTC
(In reply to Roland Scheidegger from comment #12)
> Due to d3d10 being different there, I wouldn't be surprised if other apps
> make the same mistake.

For performance, it's better to execute KILL as soon as possible, because it allows us to skip texture loads and possibly even input loads.
Comment 14 Roland Scheidegger 2016-02-18 01:17:29 UTC
(In reply to Marek Olšák from comment #13)
> (In reply to Roland Scheidegger from comment #12)
> > Due to d3d10 being different there, I wouldn't be surprised if other apps
> > make the same mistake.
> 
> For performance, it's better to execute KILL as soon as possible, because it
> allows us to skip texture loads and possibly even input loads.

Certainly, which is probably why it is undefined behavior in glsl.
My guess is the d3d10 drivers will check dependency chains and see if the result may be needed later for derivatives, and if it is only kill whole quads immediately or something like that. Whereas in GL it would be the responsibility of the shader to do the actual kill only when the pixels aren't needed any longer for calculating derivatives (though I suppose that means it potentially wouldn't early discard whole quads neither).
Comment 15 Nicolai Hähnle 2016-02-18 15:07:21 UTC
*** Bug 93591 has been marked as a duplicate of this bug. ***
Comment 16 Nicolai Hähnle 2016-02-18 15:25:02 UTC
Interesting information about D3D10, thank you.

The GLSL 4.50 spec says:

"The discard keyword is only allowed within fragment shaders. It can be used within a fragment shader to abandon the operation on the current fragment. This keyword causes the fragment to be discarded and no updates to any buffers will occur. Control flow exits the shader, and subsequent implicit or explicit derivatives are undefined when this exit is non-uniform."

One annoying aspect of this language is that one can reasonably read it as non-uniformity only being relevant for non-helper fragments. If a pixel quad is partial covered by the original primitive, and discard is used in a way that keeps the covered pixels but discard the helper ones, should derivatives be defined or not?

As Michel said, I am indeed currently working on a patch changing exec mask behavior in LLVM for stores and atomics in pixel shaders. While what I have so far does not fix this bug yet, it already requires switching back and forth between WQM and non-WQM/"exact" modes. Extending this to keep full quads alive after KILL_IF should not add much more overhead.
Comment 17 Roland Scheidegger 2016-02-18 16:06:17 UTC
(In reply to Nicolai Hähnle from comment #16)
> 
> One annoying aspect of this language is that one can reasonably read it as
> non-uniformity only being relevant for non-helper fragments. If a pixel quad
> is partial covered by the original primitive, and discard is used in a way
> that keeps the covered pixels but discard the helper ones, should
> derivatives be defined or not?

That's a good question... My interpretation would be that derivatives should be undefined in this case if only because otherwise things get even more complex...
Comment 18 Kai 2016-02-19 10:17:18 UTC
(In reply to Marek Olšák from comment #10)
> (In reply to Michel Dänzer from comment #9)
> > Note that AFAIK using things like derivatives in non-uniform control flow
> > isn't supported by GLSL. What's the original GLSL shader?
> 
> True.
> 
> The GLSL shader is using discard followed by fwidth. This is undefined
> behavior.
> 
> Therefore, it's an application bug.
> 
> The easy workaround would be to disable register allocation in st/mesa to
> get a quasi-SSA form and trivially move KILL_IF to the end of the shader if
> the app is detected to be Talos Principle.

I forwarded this information to Croteam and Dean Sekulic answered:
> Oh, just found out that discard instruction is treated as dynamic flow control
> for some reason... :/
>
> Will fix this!
>
>
> On 19-02-16 8:21, Dean Sekulic wrote:
>> I can't find any reference anywhere that doing fwidth() (or ddx/ddy for
>> that matter) after discard should be undefined behavior... :/
>>
>> GPUs should handle that case completely fine.
>>
>> Can you elaborate, please?

Looks like there might be an update for The Talos Principle which will address this.
Comment 19 Kai 2016-02-19 16:41:44 UTC
(In reply to Kai from comment #18)
> Looks like there might be an update for The Talos Principle which will
> address this.

Short update: Dean Sekulic of Croteam wrote me, that the fix on their end has been implemented and will be part of the next update.
Comment 20 Christoph Haag 2016-02-19 20:05:25 UTC
Doesn't happen in public_beta anymore.
Comment 21 Karol Herbst 2016-02-20 00:34:15 UTC
it still happens for me though on high, high, ultra settings. Didn't check other settings.
Comment 22 EoD 2016-02-20 10:17:15 UTC
(In reply to Karol Herbst from comment #21)
> it still happens for me though on high, high, ultra settings. Didn't check
> other settings.

I can confirm this. There is still a little flickering shadows on my Tonga with GPU Speed High. But this is almost negligible compared to before.
Comment 23 Nicolai Hähnle 2016-02-20 21:22:33 UTC
Created attachment 121861 [details] [review]
WholeQuadMode pass patch for LLVM

Good to see CroTeam being responsive on this.

Does the attached patch for LLVM make a difference for the remaining flicker? You can also try my branch from here: https://cgit.freedesktop.org/~nh/llvm/log/?h=images
Comment 24 Christoph Haag 2016-02-22 01:00:25 UTC
Well, not sure what exactly we are talking about. The quick flickering that made it completely unplaybable is gone.

I'm not sure how the shadows are supposed to look like. I tried https://cgit.freedesktop.org/~nh/llvm/log/?h=images and that's what it looks like
https://www.youtube.com/watch?v=DzkV_BHXzuU
(damn, gstreamer recording + vaapi encoding is stuttering a lot again)
Comment 25 Marek Olšák 2016-02-23 00:06:35 UTC
(In reply to Christoph Haag from comment #24)
> Well, not sure what exactly we are talking about. The quick flickering that
> made it completely unplaybable is gone.
> 
> I'm not sure how the shadows are supposed to look like. I tried
> https://cgit.freedesktop.org/~nh/llvm/log/?h=images and that's what it looks
> like
> https://www.youtube.com/watch?v=DzkV_BHXzuU
> (damn, gstreamer recording + vaapi encoding is stuttering a lot again)

Looks good. What fixed it? The LLVM patch or game update?
Comment 26 EoD 2016-02-25 21:46:30 UTC
(In reply to Marek Olšák from comment #25)
> (In reply to Christoph Haag from comment #24)
> > Well, not sure what exactly we are talking about. The quick flickering that
> > made it completely unplaybable is gone.
> > 
> > I'm not sure how the shadows are supposed to look like. I tried
> > https://cgit.freedesktop.org/~nh/llvm/log/?h=images and that's what it looks
> > like
> > https://www.youtube.com/watch?v=DzkV_BHXzuU
> > (damn, gstreamer recording + vaapi encoding is stuttering a lot again)
> 
> Looks good. What fixed it? The LLVM patch or game update?

The video shows from Christoph Haag shows those "minor flickerings" we were talking about. Those flickers also are in the D3D11 and Vulkan version of the game, so it's not an issue on mesa's side.

For me, the flickering disappeared after updating the game, not llvm. Do you want us to test the old game version with those llvm changes?
Comment 27 Marek Olšák 2016-02-26 10:34:23 UTC
(In reply to EoD from comment #26)
> (In reply to Marek Olšák from comment #25)
> > (In reply to Christoph Haag from comment #24)
> > > Well, not sure what exactly we are talking about. The quick flickering that
> > > made it completely unplaybable is gone.
> > > 
> > > I'm not sure how the shadows are supposed to look like. I tried
> > > https://cgit.freedesktop.org/~nh/llvm/log/?h=images and that's what it looks
> > > like
> > > https://www.youtube.com/watch?v=DzkV_BHXzuU
> > > (damn, gstreamer recording + vaapi encoding is stuttering a lot again)
> > 
> > Looks good. What fixed it? The LLVM patch or game update?
> 
> The video shows from Christoph Haag shows those "minor flickerings" we were
> talking about. Those flickers also are in the D3D11 and Vulkan version of
> the game, so it's not an issue on mesa's side.

It's not an issue. It's normal behavior. It's a trait of the shadow mapping technique and it happens when the shadow map isn't large enough. AFAIK, shadow mapping is the best way to do soft shadows, but the technique alone can create a lot of artifacts and what you've seen there aren't even the worst ones. Game developers usually spend as much time implementing shadow mapping as they spend on attempts to hide its flaws. See the images here for some extreme examples of what is "normal" with the technique:
https://takinginitiative.wordpress.com/2011/05/25/directx10-tutorial-10-shadow-mapping-part-2/

> 
> For me, the flickering disappeared after updating the game, not llvm. Do you
> want us to test the old game version with those llvm changes?

No, we can close this as fixed by the game developer. Thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.