Created attachment 142859 [details] [review]
Patch to narrow the CTS test to a smaller reproducer
I believe I've found a bug with MCS and slow clears.
If I disable fast clears (hack brw_blorp.c to add "can_fast_clear = false"), then KHR-GL43.compute_shader.resource-texture fails on my Kabylake system. The test clears a 4x MSAA 4x16 R32G32B32A32_FLOAT texture, which is using the CMS layout. Disabling CMS/MCS also fixes it. (There are actually two, one is a single slice, the other is a 2-slice array texture. Both are broken.) It then samples the textures with a compute shader.
Fast clears ought to be optional - I think they may be masking a bug.
The attached patch to the CTS test narrows it down to a much smaller test case.
This is fairly outside of my expertise, so I'd appreciate it if some MCS experts could take a look (without ruining your holiday plans)...thanks!
Sagar also noted that hacking intel_miptree_alloc_aux to set memset_value = 0 instead of memset_value = 0xFF also makes the test pass. (I also discovered 1/2/3 work, so basically any slice - just not "clear"...)
That was pretty surprising to me.
(In reply to Kenneth Graunke from comment #1)
> Sagar also noted that hacking intel_miptree_alloc_aux to set memset_value =
> 0 instead of memset_value = 0xFF also makes the test pass. (I also
> discovered 1/2/3 work, so basically any slice - just not "clear"...)
> That was pretty surprising to me.
I wonder if we need to follow the restriction we assumed was only for CCS:
If Software wants to enable Color Compression without Fast clear, Software needs to initialize MCS with zeros.
That restriction goes all the way back to HSW at least (before single sample compression). Is this the only test that fails with fast clears disabled on CMS surfaces?
I spent some time this afternoon poking at this and oh, is it ugly... Here's what I've done so far:
1) I have verified by hand that the slow clear is doing the right thing and setting the correct MCS bits to 0x0 (all on layer 0) which is what we expect for a single-color draw that covers all samples in a pixel.
2) Reverse-engineered the test a bit and determined that the issue is with the first, non-array, multi-sampled image and that it's returning 0 for all of the odd rows of pixels. The values for the first multi-samapled image are vec4s 896-1023 in the buffer_data array in the test for anyone who wants to know. I looked at the results from the two-layer image and they actually look ok.
3) I experimented with unsetting the SurfaceArray bit in the surface state thinking maybe it needed to be unset for single-layer MSAA. It made no difference.
4) I ran it on the simulator, stepped through the shader, and used "view mem" to verify that, when run in simulation, it produces the correct results. Whatever bug we're looking at is not reproduced in simulation.
5) I hacked up the compiler to put the texture coordinate in .yz of the texture result and the MCS value in .w. With this information, I was able to determine that, in the bad pixels, we're getting an MCS value of 0xff instead of 0x0 like we would expect.
6) I hacked up the aux buffer memset to write offset & 0xff into each byte instead of 0xff. Using this, I was able to determine that, for pixel (0, 1), it's fetch from 0xc8 instead of 0x10 like one would expect. As you go across the line, (1, 1) is at 0xc9, (2, 1) is at 0xca, etc. (If you're wondering about high bits, I also tested filling the aux with (offset >> 8) & 0xff and got zeros). Doing a bit more prodding around, it looks very much like a tiling calculation gone wrong. There appears to be a logic to where it puts the pixels but they're way off in the weeds.
7) I experimented with passing more zero parameters to ld_mcs thinking that maybe it was getting a garbage LOD or something. Nothing seemed to make any difference.
8) On a hunch, I started experimenting with different image widths. 4 (the width the test natively uses) is broken as are a smattering of other small widths. It appears like everything works again when the width >= 16 though some smaller widths seemed to work.
Where does that leave us? Well, we could deny MCS on any surface with a 128-bit format and width < 16. However, I have no idea if that's actually the correct limits because I have no idea what's actually going wrong. All I know is that some address calculation in the sampler is ending up in the weeds when it does the MCS fetch.