Created attachment 120988 [details] slow.shader_test It seems the glsl compiler isn't that great wrt performance. Please find attached a reasonably sized shader that takes >1m30s to compile: time piglit/bin/shader_runner slow.shader_test -auto PIGLIT: {"result": "pass" } real 1m47.913s user 1m47.676s sys 0m0.239s Similar bug is 87103
(In reply to Marc-Andre Lureau from comment #0) > Created attachment 120988 [details] > slow.shader_test > > It seems the glsl compiler isn't that great wrt performance. Can we please avoid qualitative judgments like this? Especially without understanding what the problem is, such statements are needlessly pejorative. > Please find > attached a reasonably sized shader that takes >1m30s to compile: I think we have very different definitions of what constitutes reasonably sized. > > time piglit/bin/shader_runner slow.shader_test -auto > PIGLIT: {"result": "pass" } > > real 1m47.913s > user 1m47.676s > sys 0m0.239s > > > Similar bug is 87103 The problem (I believe) is that the the shader uses a single large array vec4 temps[295]; to hold its temporary calculations (even though they are independent values), but then indirectly addresses that array, forcing the whole array to be kept alive. Moreover, we generate binary-search if ladders for the indirect accesses, blowing up the number of instructions and basic blocks. Keeping 295 vec4s alive of course leads to spilling, which leads to a cascade of slowdowns ultimately resulting in a massive and disgusting shader: 15530 instructions. 8 loops. 44798978 cycles. 296:3701 spills:fills. 3248 basic blocks. It's not clear to me if in all cases it's statically determinable whether only a subset of the array can be indirectly accessed, as is the case here: temps[44].x = float((max( temps[43].xxxx , (vec4(0,0,0,0))))); temps[45].x = float((min( temps[44].xxxx , (vec4(5,5,5,5))))); temps[46].x = float(intBitsToFloat(ivec4( temps[45].xxxx ))); addr0 = int(floatBitsToInt(temps[46].xxxx)); temps[47].x = float(( temps[addr0 + 9].wwww .x)); ... but I somewhat doubt it, because of the 8 multiply-nested loops. Improving our handling of indirect addressing (i.e., not generating the if ladders) would improve compile time and generated code quality, but I'm not sure if it would eliminate spilling. A range analysis pass feeding into an alias analysis might be able to decipher what's going on with temps[295], but again it's not clear to me from a cursory glance whether the indirect accesses are statically determinable to occur in a particular range of the array. If it was, that would drastically improve this shader's compile time and generated code. Both of those passes are lots of work. Since I cannot imagine a human writing such a shader I have to assume that it was generated programmatically. Can we improve the program generating the GLSL shader to create a more transparent shader? Namely, splitting the huge indirectly-addressed array into independently accessed components.
I read the shader a bit more and have decided that all indirect accesses could be statically determined to occur in limited ranges of the temps[] array.
Created attachment 120995 [details] new.shader_test (In reply to Matt Turner from comment #2) > I read the shader a bit more and have decided that all indirect accesses > could be statically determined to occur in limited ranges of the temps[] > array. With that knowledge in hand, I've modified slow.shader_test to limit indirect accesses to small arrays -- by making vec4 temps_9_14[6]; vec4 temps_15_20[6]; vec4 temps_21_25[5]; vec4 temps_26_30[5]; and modifying the accesses accordingly. That shader (attached with this comment) compiles in the blink of an eye and is 712 instructions. 8 loops. 1146286 cycles. 0:0 spills:fills. 115 basic blocks.
Created attachment 120996 [details] const.shader_test Additionally, I noticed that the data that's in the indirectly addressed arrays is constant. I only bothered to make the change to temps_15_20[], temps_21_25[], and temps_26_30[] (temps_9_14[]'s initializers are multiple statements, but still compile-time evaluable), and that shader generates: 602 instructions. 8 loops. 1017894 cycles. 0:0 spills:fills. 85 basic blocks. I suspect giving temps_9_14[] the same treatment would further reduce instruction counts. In short, yes, there are things that our GLSL compiler could handle better (namely, range analysis and alias analysis allowing array splitting, and indirect handling) but I don't think anyone should be surprised to get garbage out of a compiler when feeding garbage in. :)
Also, those are no small tasks, so it would take a while to accomplish on our end. I strongly advise trying to improve the shaders on the application-side in the meantime - it's likely to achieve faster compilation times sooner.
Thanks a lot for your quit feedback. I didn't mean to be pejorative, I thought there was something that could improve the situation on compiler side, and it seems I was not completely wrong, even if it takes non-trivial work. The shader was generated by virgl when visiting https://www.shadertoy.com/view/MlX3RH. It is problematic since virgl runs in the qemu mainloop, freezing the VM if something takes too long to complete. The obvious first thing to do is to move virgl in a seperate thread, but this is non-trivial too :) With the help of Dave Airlie (who wrote virgl), I'll see if we can address this in virgl guest-side or tgsi->shader code. Your analysis is very helpful, thanks again! Hopefully, this bug will help to improve the driver side too.
I've worked around this in the TGSI->GLSL convertor so it allocates separate temp arrays for the TGSI input ranges, I hadn't considered the compiler would fall over on that at all, but it makes sense. I don't think I can generate the const initialisers however, since the TGSI is just a bunch of MOVs, it would take a lot of effort to try and make sure it was safe to bump them to const inits. Might look into it another time.
On Intel, "perf" says following of the slow shader compilation: # Cost Object Symbol # .... ........... .................... 20.59% i965_dri.so brw::fs_live_variables::compute_start_end 17.71% i965_dri.so ra_allocate 8.73% i965_dri.so brw::fs_live_variables::compute_live_variables 8.67% i965_dri.so fs_visitor::virtual_grf_interferes 7.53% i965_dri.so ra_add_node_adjacency 4.95% i965_dri.so fs_visitor::assign_regs 2.50% i965_dri.so decrement_q.isra.2 2.47% i965_dri.so fs_visitor::opt_copy_propagate 2.32% libc-2.23.so __libc_calloc 2.00% i965_dri.so ra_add_node_interference 1.94% libc-2.23.so _int_malloc 1.79% libc-2.23.so _int_free ...
This shader now seems to be causing a crash which is more concerning than the original perf issues.
(In reply to Timothy Arceri from comment #9) > This shader now seems to be causing a crash which is more concerning than > the original perf issues. Bisected assert to: commit 45912fb908f7a1d2efbce0f1dbe81e5bc975fbe1 Author: Jason Ekstrand <jason.ekstrand@intel.com> Date: Fri Dec 9 22:32:08 2016 -0800 i965/compiler: Use the new nir_opt_copy_prop_vars pass We run this after nir_lower_vars_to_ssa so that as many load/store_var intrinsics as possible before copy_prop_vars executes. This is because the pass isn't particularly efficient (it does a lot of linear walks of a linked list) so we'd like as much of the work as possible to be done before copy_prop_vars runs. Shader DB results on Sky Lake: total instructions in shared programs: 12020290 -> 12013627 (-0.06%) instructions in affected programs: 26033 -> 19370 (-25.59%) helped: 16 HURT: 13 total cycles in shared programs: 137772848 -> 137549012 (-0.16%) cycles in affected programs: 6955660 -> 6731824 (-3.22%) helped: 217 HURT: 237 total loops in shared programs: 3208 -> 3208 (0.00%) loops in affected programs: 0 -> 0 helped: 0 HURT: 0 total spills in shared programs: 4112 -> 4057 (-1.34%) spills in affected programs: 483 -> 428 (-11.39%) helped: 2 HURT: 0 total fills in shared programs: 5519 -> 5102 (-7.56%) fills in affected programs: 993 -> 576 (-41.99%) helped: 2 HURT: 0 LOST: 0 GAINED: 0 Broadwell had similar results. On older hardware, the impact isn't as large because they don't advertise GL 4.5. Of the hurt programs, all but one are hurt by a single instruction and the one is hurt by 3 instructions. All of the helped programs, on the other hand, are helped by at least 3 instructions and one kerbal space program shader is helped by 44.59%. The real star of the show, however, is the Gl43CSDof synmark2 benchmark which has two shaders which are cut by 28% and 40% and the over-all runtime performance of the benchmark on my Sky Lake laptop is improved by around 25-30% (it's a bit hard to be exact due to thermal throttling). Reviewed-by: Timothy Arceri <timothy.arceri@collabora.com>
update, compiler still crashes with this shader
(In reply to Tapani Pälli from comment #11) > update, compiler still crashes with this shader What Valgrind says?
Neither of the attached shaders crashes for me (both in debug and release). I've checked on master and on system's 19.0.5. Tested them on KBL if it matters.
(In reply to Danylo from comment #13) > Neither of the attached shaders crashes for me (both in debug and release). > I've checked on master and on system's 19.0.5. Tested them on KBL if it > matters. It likely does, I tested with HSW. I can send trace later.
(In reply to Eero Tamminen from comment #12) > (In reply to Tapani Pälli from comment #11) > > update, compiler still crashes with this shader > > What Valgrind says? There are couple of invalid reads, will attach log.
Created attachment 144417 [details] valgrind output bunch of invalid reads, test passed this time
It's clear off-by-one access past the end of array: ==14419== Invalid read of size 4 ==14419== at 0x6454A39: fs_reg_alloc::set_spill_costs() (brw_fs_reg_allocate.cpp:925) ==14419== by 0x6454B26: fs_reg_alloc::choose_spill_reg() (brw_fs_reg_allocate.cpp:948) ==14419== by 0x64556FD: fs_reg_alloc::assign_regs(bool, bool) (brw_fs_reg_allocate.cpp:1163) ==14419== Address 0x83bdcb4 is 0 bytes after a block of size 10,244 alloc'd ==14419== at 0x483880B: malloc (vg_replace_malloc.c:309) ==14419== by 0x5CAF04C: ralloc_size (ralloc.c:119) ==14419== by 0x5CAF3C1: ralloc_array_size (ralloc.c:221) ==14419== by 0x64317A9: fs_visitor::calculate_live_intervals() (brw_fs_live_variables.cpp:335)
(In reply to Eero Tamminen from comment #17) > It's clear off-by-one access past the end of array: > > ==14419== Invalid read of size 4 > ==14419== at 0x6454A39: fs_reg_alloc::set_spill_costs() > (brw_fs_reg_allocate.cpp:925) > ==14419== by 0x6454B26: fs_reg_alloc::choose_spill_reg() > (brw_fs_reg_allocate.cpp:948) > ==14419== by 0x64556FD: fs_reg_alloc::assign_regs(bool, bool) > (brw_fs_reg_allocate.cpp:1163) > > ==14419== Address 0x83bdcb4 is 0 bytes after a block of size 10,244 alloc'd > ==14419== at 0x483880B: malloc (vg_replace_malloc.c:309) > ==14419== by 0x5CAF04C: ralloc_size (ralloc.c:119) > ==14419== by 0x5CAF3C1: ralloc_array_size (ralloc.c:221) > ==14419== by 0x64317A9: fs_visitor::calculate_live_intervals() > (brw_fs_live_variables.cpp:335) Yeah. However I'm sorry, I've confused with my shader_test files (argh!), this trace was for shader from somewhat similar bug #70613. But fortunately it seems I hit the very same issue with the slow.shader_test. I've bisected this to start happening with following commit, maybe should have a separate bug for that: b2d274c6775 intel/fs/ra: Choose a spill reg before throwing away the graph
Created attachment 144421 [details] valgrind output correct valgrind output running slow.shader_test on HSW
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/808.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.