Some of the dEQP-VK tests spend a lot of time in compiler. In particular these 2 ones are heavily affected:
Normally with old Mesa take take 2-3 secs to complete but with current git I measured today 38 minutes 43 seconds (!)
I've bisected this to following commit:
-------------- 8< -----------------------------
Author: Ian Romanick <firstname.lastname@example.org>
Date: Thu Jan 17 17:53:40 2019 -0800
nir: Convert a bcsel with only phi node sources to a phi node
v2: Remove the original ALU instruciton after all of its readers are
modified to read the new ALU instruction.
v3: Fix an issue where a bcsel that may not be executed on a loop
iteration due to a break statement is converted to a phi (and therefore
incorrectly "executed"). Noticed by Tim.
Fixes: 8fb8ebfbb05 ("intel/compiler: More peephole select")
Reviewed-by: Timothy Arceri <email@example.com>
I'm assigning this to Ian as bisected commit is his. The issue was noticed with Android CTS where there is a watchdog limit for tests and it is getting hit currently with many dEQP-VK tests, it's most probably this same issue in many of the cases.
I noticed that this same issue happens with Mesa 19.0.0 tag, however there we don't have b031c643491 yet! So it's possible that there are 2 separate issues or the bisected commit simply reveals something that has been there all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi fixes the issue.
(In reply to Tapani Pälli from comment #2)
> I noticed that this same issue happens with Mesa 19.0.0 tag, however there
> we don't have b031c643491 yet! So it's possible that there are 2 separate
> issues or the bisected commit simply reveals something that has been there
> all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi
> fixes the issue.
OK, I bisected stable branch and there it hits:
---------- 8< ---------------------------------
Author: Dylan Baker <firstname.lastname@example.org>
Date: Mon Feb 11 16:26:37 2019 -0800
Revert "intel/compiler: More peephole select"
This reverts commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99.
I think you're getting tricked. 8fb8ebfbb05 ("intel/compiler: More peephole select") introduced a bug that caused many SPIR-V loops to not get unrolled. On the 19.0 branch, we fixed the bug by reverting 8fb8ebfbb05. On master we fixed it by b031c643491. My guess is that those tests are fast when the loops aren't unrolled, and they are slow when the loops are unrolled. I bet they will also be slow at 8fb8ebfbb05~1.
I tried a few commits with a -Og debug build and NIR_VALIDATE=0...
18.3: 128.73user 2.96system 2:12.25elapsed 99%CPU (0avgtext+0avgdata 291884maxresident)k
ca36eb12fdf: 172.17user 3.45system 2:56.55elapsed 99%CPU (0avgtext+0avgdata 290208maxresident)k
8fb8ebfbb05: 1.84user 0.13system 0:02.13elapsed 92%CPU (0avgtext+0avgdata 269244maxresident)k
I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty big jump. What result do you get at tip of 18.3?
(In reply to Ian Romanick from comment #4)
> I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty
> big jump. What result do you get at tip of 18.3?
I made some new measurements, this time also used NIR_VALIDATE=0 and running the test with 'time' on KBL laptop. Here are results:
18.3.6: real 4m9.892s
19.0.0: real 5m15.113s
19.0.0 with "intel/compiler: More peephole select" patch: real 0m1.842s
HEAD: I gave up at: real 30m27.735s
HEAD with "opt_simplify_bcsel_of_phi" commented out: real 0m1.859s
I also checked Mesa CI time usage, and it seems CI spends a lot of time in particular with these tests, one can observer the differences between test groups here:
There is something very specific happening with '8bit_storage.8struct_to_32struct' versus other tests in 'dEQP-VK.spirv_assembly.instruction.graphics' group. I'll continue digging!
BTW I tested 19.0.0 tag because that is what Android QA are testing against. They reported also that bringing back 'intel/compiler: More peephole select' (be it buggy or not..) 'fixes' following tests:
--- 8< ---
One more observation is that time is spent especially when shader stage is geometry and type is sint or uint:
Other tests seem to execute much faster.
Created attachment 144117 [details]
I'm attaching a callgraph of running the test for couple of hours. It does not tell much but I guess something about the changes done make the register allocator nuts.
(In reply to Ian Romanick from comment #4)
> fixed it by b031c643491. My guess is that those tests are fast when the
> loops aren't unrolled, and they are slow when the loops are unrolled.
Yeah this is correct. With loop unrolling the shader is *huge* with lots and lots of variables used. Maybe we should try to somehow improve heuristics on the unroll cost, it does not make sense to unroll if the end result looks like this :|
The comparison of stats is:
790 instructions, 6 loops, 1127888 cycles. 16:10 spills:fills
32077 instructions, 0 loops, 711130 cycles. 926:2863 spills:fills
I guess it might be hard to predict though as optimizations might have a big results on the actual content.
Created attachment 144127 [details] [review]
force no unroll heuristics hack
Here's a patch that refuses to unroll when certain alu ops are met. Not sure how much sense this makes but i seems whenever these nir_op_u2u8 and nir_op_i2i8 are around enough then dEQP issue happens. Perhaps there's a bigger picture/pattern to this so that the heuristic could be improved, I'll take a look at that. Any comments/tips are welcome! :)
Created attachment 144138 [details] [review]
slightly refined patch
Here's a slightly better attempt. I'll take the bug and try to get this resolved.
There's a far deeper root cause than loop unrolling. tl;dr: Our back-end compiler sucks at handling non-32-bit types. Longer version can be found here:
Author: Jason Ekstrand <email@example.com>
Date: Wed May 29 17:46:55 2019 -0500
intel/fs: Add an UNDEF instruction to avoid excess live ranges
With 8 and 16-bit types and anything where we have to use non-trivial
strides registersto deal with restrictions, we end up with things that
look like partial writes even though we don't care about any values in
the register except those written by that instruction. This is
particularly important when dealing with loops because liveness sees
is_partial_write and the fact that an old version from a previous loop
iteration may be valid at that point and extends all purely partially
written values to the entire loop.
This commit adds a new UNDEF instruction which does nothing (the
generator doesn't emit anything) but which does a fake write to the
register. This informs liveness that we don't care about any values
before that point so it won't consider those registers to be falsely
live. We can safely emit UNDEF instructions for all SSA values that
come in from NIR and nearly all temporaries generated by various stages
of the compiler. In particular, we need to insert UNDEF instructions
when we handle region restrictions because the newly allocated registers
are almost guaranteed to be partially written.
No shader-db changes.
Reviewed-by: Matt Turner <firstname.lastname@example.org>