Bug 93681

Summary: GLSL compilation can be very slow
Product: Mesa Reporter: Marc-Andre Lureau <marcandre.lureau>
Component: glsl-compilerAssignee: Jason Ekstrand <jason>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: danylo.piliaiev, eero.t.tamminen
Version: unspecified   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=110825
Whiteboard:
i915 platform: i915 features:
Attachments: slow.shader_test
new.shader_test
const.shader_test
valgrind output
valgrind output

Description Marc-Andre Lureau 2016-01-12 17:32:03 UTC
Created attachment 120988 [details]
slow.shader_test

It seems the glsl compiler isn't that great wrt performance. Please find attached a reasonably sized shader that takes >1m30s to compile:

time piglit/bin/shader_runner slow.shader_test -auto
PIGLIT: {"result": "pass" }

real	1m47.913s
user	1m47.676s
sys	0m0.239s


Similar bug is 87103
Comment 1 Matt Turner 2016-01-12 19:00:19 UTC
(In reply to Marc-Andre Lureau from comment #0)
> Created attachment 120988 [details]
> slow.shader_test
> 
> It seems the glsl compiler isn't that great wrt performance.

Can we please avoid qualitative judgments like this? Especially without understanding what the problem is, such statements are needlessly pejorative.

> Please find
> attached a reasonably sized shader that takes >1m30s to compile:

I think we have very different definitions of what constitutes reasonably sized.

> 
> time piglit/bin/shader_runner slow.shader_test -auto
> PIGLIT: {"result": "pass" }
> 
> real	1m47.913s
> user	1m47.676s
> sys	0m0.239s
> 
> 
> Similar bug is 87103

The problem (I believe) is that the the shader uses a single large array

   vec4 temps[295];

to hold its temporary calculations (even though they are independent values), but then indirectly addresses that array, forcing the whole array to be kept alive. Moreover, we generate binary-search if ladders for the indirect accesses, blowing up the number of instructions and basic blocks. 

Keeping 295 vec4s alive of course leads to spilling, which leads to a cascade of slowdowns ultimately resulting in a massive and disgusting shader:

15530 instructions. 8 loops. 44798978 cycles. 296:3701 spills:fills. 3248 basic blocks.


It's not clear to me if in all cases it's statically determinable whether only a subset of the array can be indirectly accessed, as is the case here:

        temps[44].x = float((max( temps[43].xxxx , (vec4(0,0,0,0)))));
        temps[45].x = float((min( temps[44].xxxx , (vec4(5,5,5,5)))));
        temps[46].x = float(intBitsToFloat(ivec4( temps[45].xxxx )));
        addr0 = int(floatBitsToInt(temps[46].xxxx));
        temps[47].x = float(( temps[addr0 + 9].wwww .x));

... but I somewhat doubt it, because of the 8 multiply-nested loops.


Improving our handling of indirect addressing (i.e., not generating the if ladders) would improve compile time and generated code quality, but I'm not sure if it would eliminate spilling.

A range analysis pass feeding into an alias analysis might be able to decipher what's going on with temps[295], but again it's not clear to me from a cursory glance whether the indirect accesses are statically determinable to occur in a particular range of the array. If it was, that would drastically improve this shader's compile time and generated code. Both of those passes are lots of work.


Since I cannot imagine a human writing such a shader I have to assume that it was generated programmatically. Can we improve the program generating the GLSL shader to create a more transparent shader? Namely, splitting the huge indirectly-addressed array into independently accessed components.
Comment 2 Matt Turner 2016-01-12 23:58:29 UTC
I read the shader a bit more and have decided that all indirect accesses could be statically determined to occur in limited ranges of the temps[] array.
Comment 3 Matt Turner 2016-01-13 00:29:24 UTC
Created attachment 120995 [details]
new.shader_test

(In reply to Matt Turner from comment #2)
> I read the shader a bit more and have decided that all indirect accesses
> could be statically determined to occur in limited ranges of the temps[]
> array.

With that knowledge in hand, I've modified slow.shader_test to limit indirect accesses to small arrays -- by making

        vec4 temps_9_14[6];
        vec4 temps_15_20[6];
        vec4 temps_21_25[5];
        vec4 temps_26_30[5];

and modifying the accesses accordingly. That shader (attached with this comment) compiles in the blink of an eye and is

712 instructions. 8 loops. 1146286 cycles. 0:0 spills:fills. 115 basic blocks.
Comment 4 Matt Turner 2016-01-13 00:35:30 UTC
Created attachment 120996 [details]
const.shader_test

Additionally, I noticed that the data that's in the indirectly addressed arrays is constant. I only bothered to make the change to temps_15_20[], temps_21_25[], and temps_26_30[] (temps_9_14[]'s initializers are multiple statements, but still compile-time evaluable), and that shader generates:

602 instructions. 8 loops. 1017894 cycles. 0:0 spills:fills. 85 basic blocks.

I suspect giving temps_9_14[] the same treatment would further reduce instruction counts.


In short, yes, there are things that our GLSL compiler could handle better (namely, range analysis and alias analysis allowing array splitting, and indirect handling) but I don't think anyone should be surprised to get garbage out of a compiler when feeding garbage in. :)
Comment 5 Kenneth Graunke 2016-01-13 00:41:29 UTC
Also, those are no small tasks, so it would take a while to accomplish on our end.  I strongly advise trying to improve the shaders on the application-side in the meantime - it's likely to achieve faster compilation times sooner.
Comment 6 Marc-Andre Lureau 2016-01-17 15:36:09 UTC
Thanks a lot for your quit feedback. I didn't mean to be pejorative, I thought there was something that could improve the situation on compiler side, and it seems I was not completely wrong, even if it takes non-trivial work.

The shader was generated by virgl when visiting https://www.shadertoy.com/view/MlX3RH. It is problematic since virgl runs in the qemu mainloop, freezing the VM if something takes too long to complete. The obvious first thing to do is to move virgl in a seperate thread, but this is non-trivial too :)

With the help of Dave Airlie (who wrote virgl), I'll see if we can address this in virgl guest-side or tgsi->shader code.

Your analysis is very helpful, thanks again!

Hopefully, this bug will help to improve the driver side too.
Comment 7 Dave Airlie 2016-01-18 04:55:50 UTC
I've worked around this in the TGSI->GLSL convertor so it allocates separate temp arrays for the TGSI input ranges, I hadn't considered the compiler would fall over on that at all, but it makes sense.

I don't think I can generate the const initialisers however, since the TGSI is just a bunch of MOVs, it would take a lot of effort to try and make sure it was safe to bump them to const inits. Might look into it another time.
Comment 8 Eero Tamminen 2016-08-31 10:24:16 UTC
On Intel, "perf" says following of the slow shader compilation:

# Cost  Object       Symbol                                           
# ....  ...........  .................... 
20.59%  i965_dri.so  brw::fs_live_variables::compute_start_end
17.71%  i965_dri.so  ra_allocate
 8.73%  i965_dri.so  brw::fs_live_variables::compute_live_variables
 8.67%  i965_dri.so  fs_visitor::virtual_grf_interferes
 7.53%  i965_dri.so  ra_add_node_adjacency
 4.95%  i965_dri.so  fs_visitor::assign_regs
 2.50%  i965_dri.so  decrement_q.isra.2
 2.47%  i965_dri.so  fs_visitor::opt_copy_propagate
 2.32%  libc-2.23.so __libc_calloc
 2.00%  i965_dri.so  ra_add_node_interference
 1.94%  libc-2.23.so _int_malloc       
 1.79%  libc-2.23.so _int_free
...
Comment 9 Timothy Arceri 2018-04-12 10:50:49 UTC
This shader now seems to be causing a crash which is more concerning than the original perf issues.
Comment 10 Timothy Arceri 2018-04-13 01:37:12 UTC
(In reply to Timothy Arceri from comment #9)
> This shader now seems to be causing a crash which is more concerning than
> the original perf issues.

Bisected assert to:

commit 45912fb908f7a1d2efbce0f1dbe81e5bc975fbe1
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Dec 9 22:32:08 2016 -0800

    i965/compiler: Use the new nir_opt_copy_prop_vars pass
    
    We run this after nir_lower_vars_to_ssa so that as many load/store_var
    intrinsics as possible before copy_prop_vars executes.  This is because the
    pass isn't particularly efficient (it does a lot of linear walks of a
    linked list) so we'd like as much of the work as possible to be done before
    copy_prop_vars runs.
    
    Shader DB results on Sky Lake:
    
       total instructions in shared programs: 12020290 -> 12013627 (-0.06%)
       instructions in affected programs: 26033 -> 19370 (-25.59%)
       helped: 16
       HURT: 13
    
       total cycles in shared programs: 137772848 -> 137549012 (-0.16%)
       cycles in affected programs: 6955660 -> 6731824 (-3.22%)
       helped: 217
       HURT: 237
    
       total loops in shared programs: 3208 -> 3208 (0.00%)
       loops in affected programs: 0 -> 0
       helped: 0
       HURT: 0
    
       total spills in shared programs: 4112 -> 4057 (-1.34%)
       spills in affected programs: 483 -> 428 (-11.39%)
       helped: 2
       HURT: 0
    
       total fills in shared programs: 5519 -> 5102 (-7.56%)
       fills in affected programs: 993 -> 576 (-41.99%)
       helped: 2
       HURT: 0
    
       LOST:   0
       GAINED: 0
    
    Broadwell had similar results.  On older hardware, the impact isn't as
    large because they don't advertise GL 4.5.  Of the hurt programs, all but
    one are hurt by a single instruction and the one is hurt by 3 instructions.
    All of the helped programs, on the other hand, are helped by at least 3
    instructions and one kerbal space program shader is helped by 44.59%.
    
    The real star of the show, however, is the Gl43CSDof synmark2 benchmark
    which has two shaders which are cut by 28% and 40% and the over-all runtime
    performance of the benchmark on my Sky Lake laptop is improved by around
    25-30% (it's a bit hard to be exact due to thermal throttling).
    
    Reviewed-by: Timothy Arceri <timothy.arceri@collabora.com>
Comment 11 Tapani Pälli 2019-05-29 06:03:10 UTC
update, compiler still crashes with this shader
Comment 12 Eero Tamminen 2019-05-29 08:03:38 UTC
(In reply to Tapani Pälli from comment #11)
> update, compiler still crashes with this shader

What Valgrind says?
Comment 13 Danylo 2019-05-29 09:38:54 UTC
Neither of the attached shaders crashes for me (both in debug and release). I've checked on master and on system's 19.0.5. Tested them on KBL if it matters.
Comment 14 Tapani Pälli 2019-06-02 06:13:03 UTC
(In reply to Danylo from comment #13)
> Neither of the attached shaders crashes for me (both in debug and release).
> I've checked on master and on system's 19.0.5. Tested them on KBL if it
> matters.

It likely does, I tested with HSW. I can send trace later.
Comment 15 Tapani Pälli 2019-06-03 05:16:22 UTC
(In reply to Eero Tamminen from comment #12)
> (In reply to Tapani Pälli from comment #11)
> > update, compiler still crashes with this shader
> 
> What Valgrind says?

There are couple of invalid reads, will attach log.
Comment 16 Tapani Pälli 2019-06-03 05:17:57 UTC
Created attachment 144417 [details]
valgrind output

bunch of invalid reads, test passed this time
Comment 17 Eero Tamminen 2019-06-03 07:20:05 UTC
It's clear off-by-one access past the end of array:

==14419== Invalid read of size 4
==14419==    at 0x6454A39: fs_reg_alloc::set_spill_costs() (brw_fs_reg_allocate.cpp:925)
==14419==    by 0x6454B26: fs_reg_alloc::choose_spill_reg() (brw_fs_reg_allocate.cpp:948)
==14419==    by 0x64556FD: fs_reg_alloc::assign_regs(bool, bool) (brw_fs_reg_allocate.cpp:1163)

==14419==  Address 0x83bdcb4 is 0 bytes after a block of size 10,244 alloc'd
==14419==    at 0x483880B: malloc (vg_replace_malloc.c:309)
==14419==    by 0x5CAF04C: ralloc_size (ralloc.c:119)
==14419==    by 0x5CAF3C1: ralloc_array_size (ralloc.c:221)
==14419==    by 0x64317A9: fs_visitor::calculate_live_intervals() (brw_fs_live_variables.cpp:335)
Comment 18 Tapani Pälli 2019-06-03 10:14:43 UTC
(In reply to Eero Tamminen from comment #17)
> It's clear off-by-one access past the end of array:
> 
> ==14419== Invalid read of size 4
> ==14419==    at 0x6454A39: fs_reg_alloc::set_spill_costs()
> (brw_fs_reg_allocate.cpp:925)
> ==14419==    by 0x6454B26: fs_reg_alloc::choose_spill_reg()
> (brw_fs_reg_allocate.cpp:948)
> ==14419==    by 0x64556FD: fs_reg_alloc::assign_regs(bool, bool)
> (brw_fs_reg_allocate.cpp:1163)
> 
> ==14419==  Address 0x83bdcb4 is 0 bytes after a block of size 10,244 alloc'd
> ==14419==    at 0x483880B: malloc (vg_replace_malloc.c:309)
> ==14419==    by 0x5CAF04C: ralloc_size (ralloc.c:119)
> ==14419==    by 0x5CAF3C1: ralloc_array_size (ralloc.c:221)
> ==14419==    by 0x64317A9: fs_visitor::calculate_live_intervals()
> (brw_fs_live_variables.cpp:335)

Yeah. However I'm sorry, I've confused with my shader_test files (argh!), this trace was for shader from somewhat similar bug #70613. But fortunately it seems I hit the very same issue with the slow.shader_test. I've bisected this to start happening with following commit, maybe should have a separate bug for that:

b2d274c6775 intel/fs/ra: Choose a spill reg before throwing away the graph
Comment 19 Tapani Pälli 2019-06-03 10:37:08 UTC
Created attachment 144421 [details]
valgrind output

correct valgrind output running slow.shader_test on HSW
Comment 20 GitLab Migration User 2019-09-18 19:45:42 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/808.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.