Bug 70613 - Register spilling is slow
Summary: Register spilling is slow
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 9.1
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Kevin Rogovin
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-10-18 11:38 UTC by Kevin Rogovin
Modified: 2019-05-29 06:02 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
30 second compile. (104.40 KB, text/plain)
2013-10-18 11:38 UTC, Kevin Rogovin
Details
half second compile (104.59 KB, text/plain)
2013-10-18 11:39 UTC, Kevin Rogovin
Details
common vertex shader (1.78 KB, text/plain)
2013-10-18 11:39 UTC, Kevin Rogovin
Details

Description Kevin Rogovin 2013-10-18 11:38:32 UTC
Created attachment 87818 [details]
30 second compile.

Attached are two fragment shaders which have various noise functions within them.

fast_compile.frag.glsl takes Mesa about half a second to compile and link.
slow_compile.frag.glsl takes Mesa aboit half a MINUTE to compile and link.

The only difference between the two's source code is that the former defines a macro GREY_SCALE_NOISE. With that macro in place, calling the noise function, pnoise(vec4, vec4) happens only oncse. Without the macro it calls it 3 times, with different arguments, for each of red, green and blue.
Comment 1 Kevin Rogovin 2013-10-18 11:39:12 UTC
Created attachment 87819 [details]
half second compile
Comment 2 Kevin Rogovin 2013-10-18 11:39:38 UTC
Created attachment 87820 [details]
common vertex shader
Comment 3 Petri Latvala 2013-10-21 12:11:52 UTC
Tested with Mesa commit e6c2afa9ceacc188690deba9da7b46bf1a0a9027 (2013-10-13, 10.0.0-devel)

Slow compile times reproduced in this version. Profiling gives

(pprof) top10
Total: 1537 samples
     299  19.5%  19.5%      299  19.5% fs_visitor::virtual_grf_interferes
     217  14.1%  33.6%     1132  73.6% fs_visitor::assign_regs
     185  12.0%  45.6%      185  12.0% pq_test (inline)
     155  10.1%  55.7%      267  17.4% ra_add_node_adjacency
     136   8.8%  64.5%      138   9.0% instruction_scheduler::add_dep [clone .part.4]
      76   4.9%  69.5%       76   4.9% ra_get_spill_benefit.isra.1
      46   3.0%  72.5%      151   9.8% ra_add_node_interference
      33   2.1%  74.6%       33   2.1% get_assignment_entry
      29   1.9%  76.5%       40   2.6% _int_malloc
      24   1.6%  78.1%      100   6.5% __GI___libc_realloc

Further:

INTEL_DEBUG=perf prints

16-wide shader failed to compile, falling back to 8-wide at a 10-20% performance cost: FS compile failed: Failure to register allocate.  Reduce number of live scalar values to avoid this.
Fragment shader triggered register spilling.  Try reducing the number of live scalar values to improve performance.

while compiling the shader, and afterwards (presumably triggered by glUseProgram)

16-wide shader failed to compile, falling back to 8-wide at a 10-20% performance cost: FS compile failed: Failure to register allocate.  Reduce number of live scalar values to avoid this.
Recompiling fragment shader for program 1
  fragment color clamping 1->0
FS compile took 7057.783 ms and stalled the GPU
Fragment shader triggered register spilling.  Try reducing the number of live scalar values to improve performance.
Comment 4 Eero Tamminen 2013-10-21 14:28:34 UTC
(In reply to comment #3)
> (pprof) top10
> Total: 1537 samples

For profiling single process doing everything on CPU side (like happens with shader compilation) with low kernel side activity, I would recommend looking at it with Valgrind's callgrind:
  http://valgrind.org/docs/manual/cl-manual.html

If test program doesn't do anything extra, it's just:
  valgrind --tool=callgrind <program>

Sampling profilers can miss things that Valgrind doesn't, and the callgrind tool produces also callgraphs, which are best checked with Kcachegrind (which is included to most distros):
  http://kcachegrind.sourceforge.net/html/Home.html

(Sometimes "--tool=cachegrind" output is also interesting, but I think in this case callgraph is more.)
Comment 5 Eric Anholt 2013-11-22 23:44:23 UTC
Register spilling is exceptionally slow.  If register spilling happens in any real-world app, our highest priority is to prevent spilling, to ensure runtime performance.  So, we haven't ever optimized the compile performance of actually generating spills.

You'd want to update interference graph such that you can rerun the allocator without going through generation of live intervals and the interference graph again.
Comment 6 Kevin Rogovin 2013-11-23 10:57:01 UTC
The register spilling is a product of the fact that the noise functions are inlined without barriers of the inline. If a function was tagged with how many registers it required then those registers could be reused at its next invocation. Instead, in an attempt to optimize via inlining, that barrier is lost and register allocation and freeing is far messier to track. Indeed, that it compiles quickly with one invocation provides evidence that register use tracking is the issue... that and over inlining of functions.
Comment 7 Kenneth Graunke 2013-11-23 21:18:04 UTC
Kevin,

The GLSL compiler has no support for non-inlined functions, nor do most (all?) of the driver backends.  I have some code in a branch from 2011 that begins to add support for proper functions in the compiler, but I never got to the backend work.

It would be a huge amount of work to actually support non-inlined function calls.

Eric's point is that our register spilling code is insanely stupid: if we fail to allocate, it tries to spill a single register, and then starts over, having to recompute the live interval analysis and register interference graphs.  This is extremely wasteful and stupid.  If we fixed that, shaders that spill wouldn't take obscenely long to compile.

We just haven't found time to care yet, since real world shaders that spill are very rare.

--Ken
Comment 8 Eero Tamminen 2016-08-31 12:10:35 UTC
Ran the 30s shader with frag_runner on SKL GT2.  Startup / compile took 2.2s, so this seems much improved.

Perf profile looks now like this:
16.33%  i965_dri.so  fs_visitor::virtual_grf_interferes
13.45%  i965_dri.so  ra_allocate                   
11.03%  i965_dri.so  ra_add_node_adjacency
 8.76%  i965_dri.so  fs_visitor::assign_regs                                
 6.71%  i965_dri.so  (anon)::ir_copy_propagation_visitor::visit
 5.30%  i965_dri.so  (anon)::ir_copy_propagation_elements_visitor::kill
 4.45%  i965_dri.so  (anon)::ir_copy_propagation_visitor::kill
 3.85%  i965_dri.so  (anon)::ir_copy_propagation_elements_visitor::handle_rvalue
 3.60%  i965_dri.so  decrement_q.isra.2
 2.78%  i965_dri.so  instruction_scheduler::add_dep
 2.59%  i965_dri.so  ra_add_node_interference
 2.48%  libc-2.23.so _int_malloc
 1.90%  libc-2.23.so _int_free      
 1.08%  libc-2.23.so __libc_calloc      
 1.05%  libc-2.23.so realloc

Profile is similar to ones in bug 91857 and bug 97035, but differs from profiles for bug 93681 and bug 87103.
Comment 9 Tapani Pälli 2019-05-29 06:02:27 UTC
This shader is now 5 seconds to compile+link on HSW machine (and less than sec with shader cache), resolving as fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.