Created attachment 130103 [details] Offending SPIR-V Trying to compile a large vulkan compute shader (15Kb of SPIR-V) on Gen9 blows up to ~150Kb binary, causing an assert in anv_allocator.c:637 because it is larger than the maximum state size. The offending SPIR-V is attached and I will try to attach the GLSL that generated it once the bug is created (I seem to only be able to attach on file at creation time) We are compiling the GLSL to SPIR-V using GLSLang, and the best modulator we can find for the problem is to make "depths" on line 31 a shared array like "roi" on line 28. This reduces the final binary size from 150 Kb to 7 Kb and vastly reduces compile time. Let us know if you have any questions.
Created attachment 130104 [details] GLSL Source
Thanks for the bug report! This is a known issue and we keep bumping the maximum but apparently not far enough. I'll try and look into the overflow issue some time this week. As far as making stuff shared goes, that's not a bad idea given that you are already doing so for roi. In this case, using shared memory is basically just telling the compiler "please put this in memory, not registers" which is what you want the compiler to do in this case. Our compiler tries to fairly aggressively (too aggressively in some cases) put things in registers and this is a case where you really want it to spill. It's interesting that you found a perf cliff on NVIDIA as well. Does it help perf to put depths in shared on NVIDIA? In any case, the fact that that array explodes your shader is also somewhat our fault because this is definitely something our compiler could be doing for you and I've got plans to do just that but it hasn't happened yet.
Thanks Jason! Just out of curiosity does this mean that we should generally avoid large local arrays in compute shaders? I asked the guy that wrote that shader and he said that making depths shared on NVIDIA made it 2x slower. If you want more info I can put you in contact with him directly. Thanks again, Forrest
Thanks Jason! In general on NVIDIA, it seems faster to store values in registers than shared memory, *unless* there is not enough register space and it needs to spill the data elsewhere (to main GPU memory? to some other cache? I have no idea). In this case, it seems that making roi shared reduces the register pressure enough that depth can fit in registers; as a result, making depth shared reduces performance. At least, that is my interpretation.
I'm not seeing this assertion when I run the shader. Maybe your build of the driver is old? At one point, we bumped the maximum for programs to 1 MB which should be plenty big. Can you try master?
Hi Jason, Can you point us at the change that ups the max size so we don't have to try to satisfy the build dependencies of master branch? Will it practical to cherry pick this change? Forrest
commit ff0dd67d2f9fc7cc240da732153ac4fac4e8d04d Author: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Date: Tue Jan 10 12:44:32 2017 +0100 anv: increase ANV_MAX_STATE_SIZE_LOG2 limit to 1 MB Fixes crash in dEQP-VK.ubo.random.all_shared_buffer.48 due to a fragment shader code bigger than 128 kB. This patch increases the allocation size limit to 1 MB. v2: - Increase it to 1 MB (Jason) - Increase device->instruction_block_pool allocation size in anv_device.c (Jason) Signed-off-by: Samuel Iglesias Gonsálvez <siglesias@igalia.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
Awesome thanks! We'll give this a try as soon as possible. Cheers, Forrest
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.