Bug 110432

Summary:	dEQP-VK tests running very long time
Product:	Mesa	Reporter:	Tapani Pälli <lemody>
Component:	Drivers/DRI/i965	Assignee:	Tapani Pälli <lemody>
Status:	RESOLVED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	medium	CC:	caio.oliveira
Version:	git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	callgraph force no unroll heuristics hack slightly refined patch

Description Tapani Pälli 2019-04-15 09:07:21 UTC

Some of the dEQP-VK tests spend a lot of time in compiler. In particular these 2 ones are heavily affected:

dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct.storage_buffer_sint_geom
dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct.storage_buffer_uint_geom

Normally with old Mesa take take 2-3 secs to complete but with current git I measured today 38 minutes 43 seconds (!)


I've bisected this to following commit:

-------------- 8< -----------------------------
commit b031c643491a92a5574c7a4bd659df33f2d89bb6
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Thu Jan 17 17:53:40 2019 -0800

    nir: Convert a bcsel with only phi node sources to a phi node
    
    v2: Remove the original ALU instruciton after all of its readers are
    modified to read the new ALU instruction.
    
    v3: Fix an issue where a bcsel that may not be executed on a loop
    iteration due to a break statement is converted to a phi (and therefore
    incorrectly "executed").  Noticed by Tim.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109216
    Fixes: 8fb8ebfbb05 ("intel/compiler: More peephole select")
    Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>

Comment 1 Tapani Pälli 2019-04-15 09:08:51 UTC

I'm assigning this to Ian as bisected commit is his. The issue was noticed with Android CTS where there is a watchdog limit for tests and it is getting hit currently with many dEQP-VK tests, it's most probably this same issue in many of the cases.

Comment 2 Tapani Pälli 2019-04-15 11:22:20 UTC

I noticed that this same issue happens with Mesa 19.0.0 tag, however there we don't have b031c643491 yet! So it's possible that there are 2 separate issues or the bisected commit simply reveals something that has been there all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi fixes the issue.

Comment 3 Tapani Pälli 2019-04-15 11:35:47 UTC

(In reply to Tapani Pälli from comment #2)
> I noticed that this same issue happens with Mesa 19.0.0 tag, however there
> we don't have b031c643491 yet! So it's possible that there are 2 separate
> issues or the bisected commit simply reveals something that has been there
> all along. Anyway, with git HEAD simply skipping opt_simplify_bcsel_of_phi
> fixes the issue.

OK, I bisected stable branch and there it hits:
---------- 8< ---------------------------------
commit ca36eb12fdfb29c19379c666d24fd8c2d75064cd
Author: Dylan Baker <dylan@pnwbakers.com>
Date:   Mon Feb 11 16:26:37 2019 -0800

    Revert "intel/compiler: More peephole select"
    
    This reverts commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99.

Comment 4 Ian Romanick 2019-04-17 00:51:29 UTC

I think you're getting tricked.  8fb8ebfbb05 ("intel/compiler: More peephole select") introduced a bug that caused many SPIR-V loops to not get unrolled.  On the 19.0 branch, we fixed the bug by reverting 8fb8ebfbb05.  On master we fixed it by b031c643491.  My guess is that those tests are fast when the loops aren't unrolled, and they are slow when the loops are unrolled.  I bet they will also be slow at 8fb8ebfbb05~1.

I tried a few commits with a -Og debug build and NIR_VALIDATE=0...

18.3:        128.73user 2.96system 2:12.25elapsed 99%CPU (0avgtext+0avgdata 291884maxresident)k
ca36eb12fdf: 172.17user 3.45system 2:56.55elapsed 99%CPU (0avgtext+0avgdata 290208maxresident)k
8fb8ebfbb05:   1.84user 0.13system 0:02.13elapsed 92%CPU (0avgtext+0avgdata 269244maxresident)k

I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty big jump.  What result do you get at tip of 18.3?

Comment 5 Tapani Pälli 2019-04-17 06:02:39 UTC

(In reply to Ian Romanick from comment #4)
> I wasn't able to reproduce 38:43, but ~2 seconds -> ~2 minutes is a pretty
> big jump.  What result do you get at tip of 18.3?

I made some new measurements, this time also used NIR_VALIDATE=0 and running the test with 'time' on KBL laptop. Here are results:

18.3.6: real 4m9.892s
19.0.0: real 5m15.113s
19.0.0 with "intel/compiler: More peephole select" patch: real 0m1.842s
HEAD: I gave up at: real 30m27.735s
HEAD with "opt_simplify_bcsel_of_phi" commented out: real 0m1.859s

I also checked Mesa CI time usage, and it seems CI spends a lot of time in particular with these tests, one can observer the differences between test groups here:

https://mesa-ci.01.org/tpalli/builds/747/group/e706cc2d56fffd6fe11e94c430476ef0

There is something very specific happening with '8bit_storage.8struct_to_32struct' versus other tests in 'dEQP-VK.spirv_assembly.instruction.graphics' group. I'll continue digging!

BTW I tested 19.0.0 tag because that is what Android QA are testing against. They reported also that bringing back 'intel/compiler: More peephole select' (be it buggy or not..) 'fixes' following tests:

--- 8< ---
dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.8struct_to_32struct#storage_buffer_sint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.8struct_to_32struct#storage_buffer_uint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#storage_buffer_sint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#storage_buffer_uint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#uniform_sint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.8bit_storage.32struct_to_8struct#uniform_uint_geom
	dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_32struct_to_16struct#uniform_buffer_block_geom
	dEQP-VK.spirv_assembly.instruction.graphics.16bit_storage.uniform_32struct_to_16struct#uniform_geom

Comment 6 Tapani Pälli 2019-04-17 06:09:51 UTC

One more observation is that time is spent especially when shader stage is geometry and type is sint or uint:

https://mesa-ci.01.org/tpalli/builds/747/group/61d370971c329819a9dbda43ac5af623#subgroups

Other tests seem to execute much faster.

Comment 7 Tapani Pälli 2019-04-30 09:46:32 UTC

Created attachment 144117 [details]
callgraph

I'm attaching a callgraph of running the test for couple of hours. It does not tell much but I guess something about the changes done make the register allocator nuts.

Comment 8 Tapani Pälli 2019-04-30 10:45:46 UTC

(In reply to Ian Romanick from comment #4)
> fixed it by b031c643491.  My guess is that those tests are fast when the
> loops aren't unrolled, and they are slow when the loops are unrolled.

Yeah this is correct. With loop unrolling the shader is *huge* with lots and lots of variables used. Maybe we should try to somehow improve heuristics on the unroll cost, it does not make sense to unroll if the end result looks like this :|

Comment 9 Tapani Pälli 2019-04-30 11:01:42 UTC

The comparison of stats is:

790 instructions, 6 loops, 1127888 cycles. 16:10 spills:fills
vs
32077 instructions, 0 loops, 711130 cycles. 926:2863 spills:fills

I guess it might be hard to predict though as optimizations might have a big results on the actual content.

Comment 10 Tapani Pälli 2019-05-02 10:54:12 UTC

Created attachment 144127 [details] [review]
force no unroll heuristics hack

Here's a patch that refuses to unroll when certain alu ops are met. Not sure how much sense this makes but i seems whenever these nir_op_u2u8 and nir_op_i2i8 are around enough then dEQP issue happens. Perhaps there's a bigger picture/pattern to this so that the heuristic could be improved, I'll take a look at that. Any comments/tips are welcome! :)

Comment 11 Tapani Pälli 2019-05-03 07:43:46 UTC

Created attachment 144138 [details] [review]
slightly refined patch

Here's a slightly better attempt. I'll take the bug and try to get this resolved.

Comment 12 Jason Ekstrand 2019-05-29 22:56:27 UTC

There's a far deeper root cause than loop unrolling.  tl;dr: Our back-end compiler sucks at handling non-32-bit types.  Longer version can be found here:

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/973

Comment 13 Tapani Pälli 2019-06-06 07:08:54 UTC

commit f4ef34f207d15bcade7aed644328035dd0f2cc16
Author: Jason Ekstrand <jason@jlekstrand.net>
Date:   Wed May 29 17:46:55 2019 -0500

    intel/fs: Add an UNDEF instruction to avoid excess live ranges
    
    With 8 and 16-bit types and anything where we have to use non-trivial
    strides registersto deal with restrictions, we end up with things that
    look like partial writes even though we don't care about any values in
    the register except those written by that instruction.  This is
    particularly important when dealing with loops because liveness sees
    is_partial_write and the fact that an old version from a previous loop
    iteration may be valid at that point and extends all purely partially
    written values to the entire loop.
    
    This commit adds a new UNDEF instruction which does nothing (the
    generator doesn't emit anything) but which does a fake write to the
    register.  This informs liveness that we don't care about any values
    before that point so it won't consider those registers to be falsely
    live.  We can safely emit UNDEF instructions for all SSA values that
    come in from NIR and nearly all temporaries generated by various stages
    of the compiler.  In particular, we need to insert UNDEF instructions
    when we handle region restrictions because the newly allocated registers
    are almost guaranteed to be partially written.
    
    No shader-db changes.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110432
    Reviewed-by: Matt Turner <mattst88@gmail.com>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.