|Summary:||Register spilling is slow|
|Product:||Mesa||Reporter:||Kevin Rogovin <kevin.rogovin>|
|Component:||Drivers/DRI/i965||Assignee:||Kevin Rogovin <kevin.rogovin>|
|Status:||RESOLVED FIXED||QA Contact:||Intel 3D Bugs Mailing List <intel-3d-bugs>|
|i915 platform:||i915 features:|
30 second compile.
half second compile
common vertex shader
Description Kevin Rogovin 2013-10-18 11:38:32 UTC
Created attachment 87818 [details] 30 second compile. Attached are two fragment shaders which have various noise functions within them. fast_compile.frag.glsl takes Mesa about half a second to compile and link. slow_compile.frag.glsl takes Mesa aboit half a MINUTE to compile and link. The only difference between the two's source code is that the former defines a macro GREY_SCALE_NOISE. With that macro in place, calling the noise function, pnoise(vec4, vec4) happens only oncse. Without the macro it calls it 3 times, with different arguments, for each of red, green and blue.
Comment 1 Kevin Rogovin 2013-10-18 11:39:12 UTC
Created attachment 87819 [details] half second compile
Comment 2 Kevin Rogovin 2013-10-18 11:39:38 UTC
Created attachment 87820 [details] common vertex shader
Comment 3 Petri Latvala 2013-10-21 12:11:52 UTC
Tested with Mesa commit e6c2afa9ceacc188690deba9da7b46bf1a0a9027 (2013-10-13, 10.0.0-devel) Slow compile times reproduced in this version. Profiling gives (pprof) top10 Total: 1537 samples 299 19.5% 19.5% 299 19.5% fs_visitor::virtual_grf_interferes 217 14.1% 33.6% 1132 73.6% fs_visitor::assign_regs 185 12.0% 45.6% 185 12.0% pq_test (inline) 155 10.1% 55.7% 267 17.4% ra_add_node_adjacency 136 8.8% 64.5% 138 9.0% instruction_scheduler::add_dep [clone .part.4] 76 4.9% 69.5% 76 4.9% ra_get_spill_benefit.isra.1 46 3.0% 72.5% 151 9.8% ra_add_node_interference 33 2.1% 74.6% 33 2.1% get_assignment_entry 29 1.9% 76.5% 40 2.6% _int_malloc 24 1.6% 78.1% 100 6.5% __GI___libc_realloc Further: INTEL_DEBUG=perf prints 16-wide shader failed to compile, falling back to 8-wide at a 10-20% performance cost: FS compile failed: Failure to register allocate. Reduce number of live scalar values to avoid this. Fragment shader triggered register spilling. Try reducing the number of live scalar values to improve performance. while compiling the shader, and afterwards (presumably triggered by glUseProgram) 16-wide shader failed to compile, falling back to 8-wide at a 10-20% performance cost: FS compile failed: Failure to register allocate. Reduce number of live scalar values to avoid this. Recompiling fragment shader for program 1 fragment color clamping 1->0 FS compile took 7057.783 ms and stalled the GPU Fragment shader triggered register spilling. Try reducing the number of live scalar values to improve performance.
Comment 4 Eero Tamminen 2013-10-21 14:28:34 UTC
(In reply to comment #3) > (pprof) top10 > Total: 1537 samples For profiling single process doing everything on CPU side (like happens with shader compilation) with low kernel side activity, I would recommend looking at it with Valgrind's callgrind: http://valgrind.org/docs/manual/cl-manual.html If test program doesn't do anything extra, it's just: valgrind --tool=callgrind <program> Sampling profilers can miss things that Valgrind doesn't, and the callgrind tool produces also callgraphs, which are best checked with Kcachegrind (which is included to most distros): http://kcachegrind.sourceforge.net/html/Home.html (Sometimes "--tool=cachegrind" output is also interesting, but I think in this case callgraph is more.)
Comment 5 Eric Anholt 2013-11-22 23:44:23 UTC
Register spilling is exceptionally slow. If register spilling happens in any real-world app, our highest priority is to prevent spilling, to ensure runtime performance. So, we haven't ever optimized the compile performance of actually generating spills. You'd want to update interference graph such that you can rerun the allocator without going through generation of live intervals and the interference graph again.
Comment 6 Kevin Rogovin 2013-11-23 10:57:01 UTC
The register spilling is a product of the fact that the noise functions are inlined without barriers of the inline. If a function was tagged with how many registers it required then those registers could be reused at its next invocation. Instead, in an attempt to optimize via inlining, that barrier is lost and register allocation and freeing is far messier to track. Indeed, that it compiles quickly with one invocation provides evidence that register use tracking is the issue... that and over inlining of functions.
Comment 7 Kenneth Graunke 2013-11-23 21:18:04 UTC
Kevin, The GLSL compiler has no support for non-inlined functions, nor do most (all?) of the driver backends. I have some code in a branch from 2011 that begins to add support for proper functions in the compiler, but I never got to the backend work. It would be a huge amount of work to actually support non-inlined function calls. Eric's point is that our register spilling code is insanely stupid: if we fail to allocate, it tries to spill a single register, and then starts over, having to recompute the live interval analysis and register interference graphs. This is extremely wasteful and stupid. If we fixed that, shaders that spill wouldn't take obscenely long to compile. We just haven't found time to care yet, since real world shaders that spill are very rare. --Ken
Comment 8 Eero Tamminen 2016-08-31 12:10:35 UTC
Ran the 30s shader with frag_runner on SKL GT2. Startup / compile took 2.2s, so this seems much improved. Perf profile looks now like this: 16.33% i965_dri.so fs_visitor::virtual_grf_interferes 13.45% i965_dri.so ra_allocate 11.03% i965_dri.so ra_add_node_adjacency 8.76% i965_dri.so fs_visitor::assign_regs 6.71% i965_dri.so (anon)::ir_copy_propagation_visitor::visit 5.30% i965_dri.so (anon)::ir_copy_propagation_elements_visitor::kill 4.45% i965_dri.so (anon)::ir_copy_propagation_visitor::kill 3.85% i965_dri.so (anon)::ir_copy_propagation_elements_visitor::handle_rvalue 3.60% i965_dri.so decrement_q.isra.2 2.78% i965_dri.so instruction_scheduler::add_dep 2.59% i965_dri.so ra_add_node_interference 2.48% libc-2.23.so _int_malloc 1.90% libc-2.23.so _int_free 1.08% libc-2.23.so __libc_calloc 1.05% libc-2.23.so realloc Profile is similar to ones in bug 91857 and bug 97035, but differs from profiles for bug 93681 and bug 87103.
Comment 9 Tapani Pälli 2019-05-29 06:02:27 UTC
This shader is now 5 seconds to compile+link on HSW machine (and less than sec with shader cache), resolving as fixed.