Bug 111141

Summary:	[REGRESSION] [BISECTED] [DXVK] 1-bit booleans and Elite Dangerous shader mis-optimization
Product:	Mesa	Reporter:	Steven Newbury <s_j_newbury>
Component:	Drivers/Vulkan/radeon	Assignee:	mesa-dev
Status:	RESOLVED MOVED	QA Contact:	mesa-dev
Severity:	normal
Priority:	medium
Version:	19.0
Hardware:	All
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	After commit 3b3081479163475f25b908008250d83c31716c34 Before commit 3b3081479163475f25b908008250d83c31716c34 RenderDoc capture after commit RenderDoc capture before commit intel_hd620_picture

Description Steven Newbury 2019-07-15 23:02:52 UTC

I've bisected a shader induced corruption issue to:

3b3081479163475f25b908008250d83c31716c34
nir/algebraic: Optimize 1-bit Booleans 

Everything renders normally prior to this commit, afterwards some textures appear "posterized" and ambiently lit areas are highlighted in the wrong colours.

I'll attach screenshots tomorrow.

I will provide any information on request/direction.

Comment 1 Steven Newbury 2019-07-16 07:16:51 UTC

Created attachment 144793 [details]
After commit 3b3081479163475f25b908008250d83c31716c34

Comment 2 Steven Newbury 2019-07-16 07:23:13 UTC

Created attachment 144794 [details]
Before commit 3b3081479163475f25b908008250d83c31716c34

Comment 3 Steven Newbury 2019-07-16 12:42:56 UTC

Created attachment 144801 [details]
RenderDoc capture after commit

Comment 4 Steven Newbury 2019-07-16 13:04:32 UTC

Created attachment 144802 [details]
RenderDoc capture before commit

Comment 5 Jason Ekstrand 2019-07-16 13:41:13 UTC

Could you also please say what GPU you're seeing the corruption on? It might matter and it's certainly needed in order for the RenderDoc taxes to be off any use.

Comment 6 Steven Newbury 2019-07-16 13:56:32 UTC

GPU is a POLARIS10 (RX470)

Comment 7 Denis 2019-07-17 14:57:19 UTC

Created attachment 144810 [details]
intel_hd620_picture

hi, looks like this issue is not actual for intel gpu. Tested on HD 620 (KBL). Picture attached (my settings in game are "low")

Comment 8 Steven Newbury 2019-07-18 08:52:45 UTC

So presumably it's the optimization for AMD?

I had a good look through the code but I'm not sufficiently clear as to how it all works to really know where the bug might be.

My current understanding, please correct me if I'm wrong:

The game is shipped with HLSL shaders compiled to DXBC

DXVK converts those DXBC -> SPIR-V [D3D int32_t booleans are converted to SPIR-V boolean type] 

(At this point all must be okay since it worked before and still works with Intel, except that Intel has a different internal representation...)

SPIR-V -> NIR [SPIR-V booleans are converted to int1_t]

NIR -> GPU HW Shader [AMD Scaler booleans; Intel 1+31bit booleans]

(What happens if the booleans are part of a struct and the code assumes they're 32-bit during the above passes?  Previously, NIR used D3D compatible booleans so it would just work, Intel is 32bit so maybe it all falls back into place?)

Is the above nonsense?

Comment 9 Jason Ekstrand 2019-07-19 18:19:22 UTC

It's not nonsense at all.  I could see a few possible explanations of the failure on AMD but not Intel:

 1. The shader depends on undefined behavior and the optimization change caused that undefined behavior to be different and Intel just happens to work by luck.

 1. The optimization change caused NIR to turn a valid shader into something which depends on undefined behavior and Intel just happens to work by luck.

 2. The optimization change caused NIR to generate a slightly different shader which exposed a bug in the AMD LLVM back-end or the NIR -> LLVM translator.

I'd give the three even odds at this point.  I just finished looking quite a bit at that shader in the context of https://bugs.freedesktop.org/show_bug.cgi?id=111152 and it has 23 undefined values at the top of the shader.  Let me look at it a bit and see where some of them are used.  They may be harmless.

Comment 10 Jason Ekstrand 2019-07-19 18:39:38 UTC

The undefined values appear harmless so I'm going to guess that this is probably actually a RADV bug.  Not knowing too much about RADV, how I'd go aboug debugging it next would be to try "bisecting" nir_opt_algebraic.py by commenting out large chunks of it and seeing if you can narrow down which optimization triggers the problem.  Fortunately, with a renderdoc trace, it's pretty quick to test out changes (before/after shouldn't matter for the renderdoc capture in this case) so it shouldn't take a terribly long time.  Once you've figured out which optimization is the culpret, we can see if that optimization is buggy for some reason.  If the NIR optimization looks sound, you can try dumping the shaders out of RADV (again, I have no idea how to do that), diff them, and see if you can figure out why it's a problem.

For the moment, I'm going to move it over to RADV so that those people get notified.

Comment 11 Steven Newbury 2019-07-20 10:27:07 UTC

(In reply to Jason Ekstrand from comment #9)

>  2. The optimization change caused NIR to generate a slightly different
> shader which exposed a bug in the AMD LLVM back-end or the NIR -> LLVM
> translator.
> 
I did try it with ACO too - which resulted in excactly the same output, so that suggests it isn't the LLVM back-end unless it's a fallback operation..?

Comment 12 Connor Abbott 2019-07-22 15:13:16 UTC

It seems I get an error trying to download your capture, probably because it's too big, can you upload it somewhere else?

Also, radv sometimes (intentionally or not) has a slightly different pass ordering or lowers things differently, which can make a NIR-level bug still only appear on radv.

Comment 13 Steven Newbury 2019-07-22 16:46:56 UTC

I've put it on my local web server:
http://www.snewbury.org.uk/before.rdc
http://www.snewbury.org.uk/after.rdc

Comment 14 Connor Abbott 2019-07-23 10:00:02 UTC

I tried this on my polaris10 card (Rx 580) and I couldn't see the corruption with either the commit you mentioned + LLVM 8.0, or a recent mesa master + LLVM master. The before trace wouldn't render with an "Error   - Unrecognised section type 'c0'", but the after trace didn't have the corruption (btw, you don't need to record two separate traces -- the trace only has the game's rendering commands, and not the final output, so if there's a bug in the driver it'll be recreated when replaying the trace with the buggy driver). Does the corruption still occur for you when you replay your trace under renderdoc with a recent mesa?

Comment 15 Steven Newbury 2019-07-23 12:15:19 UTC

I've tried recent versions, and compiled with just "-O2", every version since the commit behaves that way for me.  I'm going to try to rebuild llvm with -O2, perhaps llvm is getting subtly miscompiled.  It's strange that everything else I've tried has worked fine thougth!

Are there any other dependencies which might affect shader compilation/rendering that I might try to rebuild?

(I'm on Gentoo so trying different versions or complier flags isn't an issue)

Comment 16 Steven Newbury 2019-07-23 17:19:31 UTC

Replaying the trace with a recent Mesa causes my GPU to crash in such a way it requires reboot.  Is that expected to work?

I've rebuit mesa + llvm + xorg-server git master using gcc-9.1 and C(XX)FLAGS=-O2 and I still get the same output.  Previously, I was building llvm with clang.  Proton/DXVK is built using "-O2 -march=native", also with gcc-9.1.

Is it a bug in my GPU?

Comment 17 Connor Abbott 2019-07-25 09:17:56 UTC

No, crashing when replaying is definitely not expected, although the result of some bug could definitely be a GPU hang. It's really strange, though, since I can replay it just fine on my machine with a similar card. Can you get the dmesg from before it crashes?

One other thing you can try is to build mesa with -Dbuildtype=debug (i.e. with assertions enabled and no optimizations) and see if there's an assertion fail somewhere, or if it magically fixes itself.

The only other thing I can think of would be to replay the trace with "RADV_DEBUG=shaders renderdoccmd replay ..." and uploading the output so I can diff it. I don't know if renderdoc compiles shaders in parallel, so you might need to force it to use one thread with e.g. numactl in order to get a consistent output.

Comment 18 Steven Newbury 2019-07-31 17:51:31 UTC

(In reply to Connor Abbott from comment #17)

> One other thing you can try is to build mesa with -Dbuildtype=debug (i.e.
> with assertions enabled and no optimizations) and see if there's an
> assertion fail somewhere, or if it magically fixes itself.
> 
I'll try this first since it's the easiest.

It is perplexing what could possibly be causing my system to act differently, especially since it didn't demonstrate anything odd prior to the boolean change.

Comment 19 Steven Newbury 2019-08-01 16:02:00 UTC

(In reply to Steven Newbury from comment #18)
> (In reply to Connor Abbott from comment #17)
> 
> > One other thing you can try is to build mesa with -Dbuildtype=debug (i.e.
> > with assertions enabled and no optimizations) and see if there's an
> > assertion fail somewhere, or if it magically fixes itself.
> > 
> I'll try this first since it's the easiest.
> 
> It is perplexing what could possibly be causing my system to act
> differently, especially since it didn't demonstrate anything odd prior to
> the boolean change.

Compiling latest llvm/mesa with debug gives the same visual result but emits continous "../mesa-9999/src/amd/vulkan/radv_descriptor_set.c:496: VK_ERROR_OUT_OF_POOL_MEMORY"

... probably not related.

Comment 20 Steven Newbury 2019-08-01 16:28:53 UTC

(In reply to Jason Ekstrand from comment #10)
> The undefined values appear harmless so I'm going to guess that this is
> probably actually a RADV bug.  Not knowing too much about RADV, how I'd go
> aboug debugging it next would be to try "bisecting" nir_opt_algebraic.py by
> commenting out large chunks of it and seeing if you can narrow down which
> optimization triggers the problem.  Fortunately, with a renderdoc trace,
> it's pretty quick to test out changes (before/after shouldn't matter for the
> renderdoc capture in this case) so it shouldn't take a terribly long time. 
> Once you've figured out which optimization is the culpret, we can see if
> that optimization is buggy for some reason.  If the NIR optimization looks
> sound, you can try dumping the shaders out of RADV (again, I have no idea
> how to do that), diff them, and see if you can figure out why it's a problem.
> 
> For the moment, I'm going to move it over to RADV so that those people get
> notified.



Commenting out the bcsel@32 optimizations makes it work.

I'll try enabling each one now...

Comment 21 Steven Newbury 2019-08-01 20:06:02 UTC

The first one alone is enough to trigger the behaviour. It just crashes with the first disabled and the others enabled.

Comment 22 Steven Newbury 2019-08-02 10:44:25 UTC

Essentially reverting 3371de38f282c77461bbe5007a2fec2a975776df makes it work...  

...why?

Comment 23 Adam Jackson 2019-09-18 20:10:32 UTC

https://gitlab.freedesktop.org/mesa/mesa/issues/867

Comment 24 Samuel Pitoiset 2019-10-30 16:44:26 UTC

Are you still able to reproduce this problem?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.