The OpenGL ES 3.0 CTS: ES3-CTS.shaders.uniform_block.random.nested_structs_arrays.9 fails for BDW with Mesa 11.0.2 (git-51e0b06).
I have bisected the first problematic commit to:
Author: Connor Abbott <email@example.com>
Date: Tue Jun 30 13:38:20 2015 -0700
i965/fs: remove special case in setup_payload_interference()
regs_read() will handle LINTERP for us since the previous commit. In
addition, we were being too conservative, since it will only read 2
registers on SIMD8.
instructions in affected programs: 9061 -> 8893 (-1.85%)
All of the changes were due to spills being eliminated, mostly in KSP
Reviewed-by: Jordan Justen <firstname.lastname@example.org>
Signed-off-by: Connor Abbott <email@example.com>
With a revert of above commit the test pass. However, the revert cause regressions for other CTS tests. Also, the commit appear to be part of a cooperative patch-set from Jordan and Connor.
Potential relation to https://bugs.freedesktop.org/show_bug.cgi?id=92113 has not yet been proven.
Please note that on newer version of the Khronos CTS tests this test has been renamed:
The Fail is reproduced on Mesa: git@102f6
I did a little digging on this this afternoon. I'm going to write down what I found here before I head home. I don't know when I'll have time to poke at this again, so I want to leave notes for the next person.
I believe Connor's patch is a complete red herring; it only changes things because it causes register allocation to be subtly different on SIMD8. Running the test, I noticed that if you run it with INTEL_DEBUG=no16, it fails and, before the window closes, it flashes yellow. Looking at the shader it generates we see
> dEQP_FragColor = vec4(1.0, v_vtxResult, result, 1.0);
So a color of yellow means that v_vtxResult is fine, but result is 0. This means that the failure is in one of the comparisons in the fragment shader. In particular, the fault is not the vertex shader or nor is it in the interpolation of the vertex result. I see two potential ways forward:
1) Play with the dEQP test case and see if we can get a smaller failing case
2) Step through the fragment shader in the simulator and see if we can see where the false comparison is coming from.
My first attempt would probably be number 2 but, as I said above, I don't know when I'll have time to look at it again so if someone else wants to give it a try, feel free.
So, I did a little more investigation last night and Matt and I looked at it a bit this morning. Here's what we found:
1) If you step through it in the simulator, the simulator produces "correct" (would pass the test) results. I double and tripple-checked this and I can't get the simulator to give me yellow.
2) If you hack up could_coissue() in brw_fs_combine_constants.cpp, it generates the exact same code for HSW as it does for BDW but HSW passes.
3) We generate the exact same code for BSW as we do for BDW and BSW passes.
The above seems to rule out any sort of register allocation or scheduling errors. Instead, it's almost certain that we're hitting some undocumented and unsimulated hardware bug. Unfortunately, the shader (and hitting the bug) is very sensitive to register allocation and anything you do that changes register allocation is liable to make the shader not hit the bug.
This means that there are a pile of different things that one could do to make the shader go from pass to fail but none of them actually fix anything! We need to be very careful when we claim that a patch fixes this bug to make sure that the patch actually does fix something and doesn't just hide it again.
This bug has been fixed by:
Author: Connor Abbott <firstname.lastname@example.org>
AuthorDate: Sat Jun 6 13:32:21 2015 -0400
Commit: Connor Abbott <email@example.com>
CommitDate: Fri Oct 30 02:19:00 2015 -0400
i965: always run the post-RA scheduler
Before, we would only do scheduling after register allocation if we
spilled, despite the fact that the pre-RA scheduler was only supposed to
be for register pressure and set the latencies of every instruction to
1. This meant that unless we spilled, which we rarely do, then we never
considered instruction latencies at all, and we usually never bothered
to try and hide texture fetch latency. Although a later commit removes
the setting the latency to 1 part, we still want to always run the
post-RA scheduler since it's able to take the false dependencies that
the register allocator creates into account, and it can be more
aggressive than the pre-RA scheduler since it doesn't have to worry
about register pressure at all.
Test master post-ra-sched diff %diff
bench_OglPSBump2 396.730 402.386 5.656 +1.400%
bench_OglPSBump8 244.370 247.591 3.221 +1.300%
bench_OglPSPhong 241.117 242.002 0.885 +0.300%
bench_OglPSPom 59.555 59.725 0.170 +0.200%
bench_OglShMapPcf 86.149 102.346 16.197 +18.800%
bench_OglVSTangent 388.849 395.489 6.640 +1.700%
bench_trex 65.471 65.862 0.390 +0.500%
bench_trexoff 69.562 70.150 0.588 +0.800%
bench_heaven 25.179 25.254 0.074 +0.200%
Reviewed-by: Jason Ekstrand <firstname.lastname@example.org>
(In reply to Mark Janes from comment #4)
> This bug has been fixed by:
> Author: Connor Abbott <email@example.com>
> AuthorDate: Sat Jun 6 13:32:21 2015 -0400
> Commit: Connor Abbott <firstname.lastname@example.org>
> CommitDate: Fri Oct 30 02:19:00 2015 -0400
> i965: always run the post-RA scheduler
Not fixed, just hidden. It's a very obscure hardware bug and any sort of shuffling around of instructions is liable to hide it. However, it's interesting that running post-RA scheduling changes things because that means it has nothing to do with register allocation. I think this is more evidence of it being some sort of dependency-tracking bug in the hardware.
I'm re-opening this so we don't forget it's here.