Some of the dEQP-VK tests spend a lot of time in compiler. In particular these 2 ones are heavily affected: dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct.storage_buffer_sint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct.storage_buffer_uint_geom Normally with old Mesa take take 2-3 secs to complete but with current git I measured today 38 minutes 43 seconds (!) I've bisected this to following commit: -------------- 8< ----------------------------- commit b031c643491a92a5574c7a4bd659df33f2d89bb6 Author: Ian Romanick <ian.d.romanick@intel.com> Date: Thu Jan 17 17:53:40 2019 -0800 nir: Convert a bcsel with only phi node sources to a phi node v2: Remove the original ALU instruciton after all of its readers are modified to read the new ALU instruction. v3: Fix an issue where a bcsel that may not be executed on a loop iteration due to a break statement is converted to a phi (and therefore incorrectly "executed"). Noticed by Tim. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109216 Fixes: 8fb8ebfbb05 ("intel/compiler: More peephole select") Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
I'm assigning this to Ian as bisected commit is his. The issue was noticed with Android CTS where there is a watchdog limit for tests and it is getting hit currently with many dEQP-VK tests, it's most probably this same issue in many of the cases.
I noticed that this same issue happens with Mesa 19.0.0 tag, however there we don't have b031c643491 yet! So it's possible that there are 2 separate issues or the bisected commit simply reveals something that has been there all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi fixes the issue.
(In reply to Tapani Pälli from comment #2) > I noticed that this same issue happens with Mesa 19.0.0 tag, however there > we don't have b031c643491 yet! So it's possible that there are 2 separate > issues or the bisected commit simply reveals something that has been there > all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi > fixes the issue. OK, I bisected stable branch and there it hits: ---------- 8< --------------------------------- commit ca36eb12fdfb29c19379c666d24fd8c2d75064cd Author: Dylan Baker <dylan@pnwbakers.com> Date: Mon Feb 11 16:26:37 2019 -0800 Revert "intel/compiler: More peephole select" This reverts commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99.
I think you're getting tricked. 8fb8ebfbb05 ("intel/compiler: More peephole select") introduced a bug that caused many SPIR-V loops to not get unrolled. On the 19.0 branch, we fixed the bug by reverting 8fb8ebfbb05. On master we fixed it by b031c643491. My guess is that those tests are fast when the loops aren't unrolled, and they are slow when the loops are unrolled. I bet they will also be slow at 8fb8ebfbb05~1. I tried a few commits with a -Og debug build and NIR_VALIDATE=0... 18.3: 128.73user 2.96system 2:12.25elapsed 99%CPU (0avgtext+0avgdata 291884maxresident)k ca36eb12fdf: 172.17user 3.45system 2:56.55elapsed 99%CPU (0avgtext+0avgdata 290208maxresident)k 8fb8ebfbb05: 1.84user 0.13system 0:02.13elapsed 92%CPU (0avgtext+0avgdata 269244maxresident)k I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty big jump. What result do you get at tip of 18.3?
(In reply to Ian Romanick from comment #4) > I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty > big jump. What result do you get at tip of 18.3? I made some new measurements, this time also used NIR_VALIDATE=0 and running the test with 'time' on KBL laptop. Here are results: 18.3.6: real 4m9.892s 19.0.0: real 5m15.113s 19.0.0 with "intel/compiler: More peephole select" patch: real 0m1.842s HEAD: I gave up at: real 30m27.735s HEAD with "opt_simplify_bcsel_of_phi" commented out: real 0m1.859s I also checked Mesa CI time usage, and it seems CI spends a lot of time in particular with these tests, one can observer the differences between test groups here: https://mesa-ci.01.org/tpalli/builds/747/group/e706cc2d56fffd6fe11e94c430476ef0 There is something very specific happening with '8bit_storage.8struct_to_32struct' versus other tests in 'dEQP-VK.spirv_assembly.instruction.graphics' group. I'll continue digging! BTW I tested 19.0.0 tag because that is what Android QA are testing against. They reported also that bringing back 'intel/compiler: More peephole select' (be it buggy or not..) 'fixes' following tests: --- 8< --- dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.8struct_to_32struct#storage_buffer_sint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.8struct_to_32struct#storage_buffer_uint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#storage_buffer_sint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#storage_buffer_uint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#uniform_sint_geom dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#uniform_uint_geom dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_32struct_to_16struct#uniform_buffer_block_geom dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_32struct_to_16struct#uniform_geom
One more observation is that time is spent especially when shader stage is geometry and type is sint or uint: https://mesa-ci.01.org/tpalli/builds/747/group/61d370971c329819a9dbda43ac5af623#subgroups Other tests seem to execute much faster.
Created attachment 144117 [details] callgraph I'm attaching a callgraph of running the test for couple of hours. It does not tell much but I guess something about the changes done make the register allocator nuts.
(In reply to Ian Romanick from comment #4) > fixed it by b031c643491. My guess is that those tests are fast when the > loops aren't unrolled, and they are slow when the loops are unrolled. Yeah this is correct. With loop unrolling the shader is *huge* with lots and lots of variables used. Maybe we should try to somehow improve heuristics on the unroll cost, it does not make sense to unroll if the end result looks like this :|
The comparison of stats is: 790 instructions, 6 loops, 1127888 cycles. 16:10 spills:fills vs 32077 instructions, 0 loops, 711130 cycles. 926:2863 spills:fills I guess it might be hard to predict though as optimizations might have a big results on the actual content.
Created attachment 144127 [details] [review] force no unroll heuristics hack Here's a patch that refuses to unroll when certain alu ops are met. Not sure how much sense this makes but i seems whenever these nir_op_u2u8 and nir_op_i2i8 are around enough then dEQP issue happens. Perhaps there's a bigger picture/pattern to this so that the heuristic could be improved, I'll take a look at that. Any comments/tips are welcome! :)
Created attachment 144138 [details] [review] slightly refined patch Here's a slightly better attempt. I'll take the bug and try to get this resolved.
There's a far deeper root cause than loop unrolling. tl;dr: Our back-end compiler sucks at handling non-32-bit types. Longer version can be found here: https://gitlab.freedesktop.org/mesa/mesa/merge_requests/973
commit f4ef34f207d15bcade7aed644328035dd0f2cc16 Author: Jason Ekstrand <jason@jlekstrand.net> Date: Wed May 29 17:46:55 2019 -0500 intel/fs: Add an UNDEF instruction to avoid excess live ranges With 8 and 16-bit types and anything where we have to use non-trivial strides registersto deal with restrictions, we end up with things that look like partial writes even though we don't care about any values in the register except those written by that instruction. This is particularly important when dealing with loops because liveness sees is_partial_write and the fact that an old version from a previous loop iteration may be valid at that point and extends all purely partially written values to the entire loop. This commit adds a new UNDEF instruction which does nothing (the generator doesn't emit anything) but which does a fake write to the register. This informs liveness that we don't care about any values before that point so it won't consider those registers to be falsely live. We can safely emit UNDEF instructions for all SSA values that come in from NIR and nearly all temporaries generated by various stages of the compiler. In particular, we need to insert UNDEF instructions when we handle region restrictions because the newly allocated registers are almost guaranteed to be partially written. No shader-db changes. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110432 Reviewed-by: Matt Turner <mattst88@gmail.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.