Summary: | i965 spilling is slow | ||
---|---|---|---|
Product: | Mesa | Reporter: | Clive McCarthy <CliveMcCarthy> |
Component: | Drivers/DRI/i965 | Assignee: | Ian Romanick <idr> |
Status: | RESOLVED FIXED | QA Contact: | Intel 3D Bugs Mailing List <intel-3d-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | eero.t.tamminen, intel-gfx-bugs, lemody, rhyskidd |
Version: | unspecified | ||
Hardware: | x86 (IA32) | ||
OS: | Linux (All) | ||
See Also: |
https://bugs.freedesktop.org/show_bug.cgi?id=97035 https://bugs.freedesktop.org/show_bug.cgi?id=70613 |
||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
fragment shader
piglit shader_runner shader test |
Description
Clive McCarthy
2015-09-02 23:28:41 UTC
Without having the GLSL source it will be very hard to figure out what would be the key problem, there might be something in this very shader which makes things slow. Please attach GLSL source, 88 secs sounds *weird*. (In reply to Tapani Pälli from comment #2) > Please attach GLSL source, 88 secs sounds *weird*. I will attach it later today. It is indeed *weird* but apparently not weird for Nvidia. The vertex shader is quite trivial but the fragment shader has a moderately large switch statement. Thank you for being interested. (In reply to Tapani Pälli from comment #2) > Please attach GLSL source, 88 secs sounds *weird*. Here's the vertex shader. It will make you laugh. varying vec2 tex_coord; void main(void) { tex_coord = gl_MultiTexCoord0.st; gl_Position = gl_ProjectionMatrix * gl_ModelViewMatrix * gl_Vertex; } Doesn't get simpler. I feel confident the problem isn't caused by the vertex shader. :) We do lots of optimizations at link time. It's extremely likely that we are hitting a pessimal case in one of the passes with your fragment shader. Can you post it as well? Created attachment 118071 [details]
fragment shader
Here's my obsfucated shader.
I realized a while ago that there was serious stuff going on in the linker, way beyond what a typical linker does, simply because of some the messages that were emitted by the linker and not the compiler.
The main function is just a long switch statement. It may be unusual but it's legal and it works.
It is perhaps the only writing in the world that has the words donald_trump and introspection... Thanks for attaching the shader, will do some profiling on this later. In case this is a 'real life' case and you are having problem with this: while this all is 'legal', IMO it would make more sense to write multiple shaders for different purposes and then minimize state changes by grouping the objects for rendering pass (a lot of apps and games do this). This will likely result in better performance on any given platform compared to a 'über-shader' approach. What comes to 'apples and oranges' comparison, on Nvidia you will need to measure without shader cache. According to documentation below you should either remove '$HOME/.nv/GLCache' or export __GL_SHADER_DISK_CACHE=false before measuring (don't know if these work because docs seem a bit old): http://us.download.nvidia.com/XFree86/Linux-x86/319.32/README/openglenvvariables.html Then the comparison should be more fair. (In reply to Tapani Pälli from comment #8) > Thanks for attaching the shader, will do some profiling on this later. > > In case this is a 'real life' case and you are having problem with this: > while this all is 'legal', IMO it would make more sense to write multiple > shaders for different purposes and then minimize state changes by grouping > the objects for rendering pass (a lot of apps and games do this). This will > likely result in better performance on any given platform compared to a > 'über-shader' approach. > > What comes to 'apples and oranges' comparison, on Nvidia you will need to > measure without shader cache. According to documentation below you should > either remove '$HOME/.nv/GLCache' or export __GL_SHADER_DISK_CACHE=false > before measuring (don't know if these work because docs seem a bit old): > > http://us.download.nvidia.com/XFree86/Linux-x86/319.32/README/ > openglenvvariables.html > > Then the comparison should be more fair. The shader itself is fast enough and works just fine (weird as it might be). The speed issue is with the Mesa linker, running x86 code. My über-shader allows me to try various approaches and experiment very quickly. The timing of the linker was done with gettimeofday() and represents how long the linker took before any rendering even began. This is not a rendering problem. Just for grins I did this bananas and pears comparison: I ran my code on an Atom 330 1.6GHz with an Nvidia ION GPU. This was once my main target platform before the Intel NUCs arrived and effectively killed off Nvidia's low-end graphics. OpenGL version: 2.1.2 NVIDIA 331.38 OpenGL renderer: ION/integrated/SSE2 GLSL version: 1.20 NVIDIA via Cg compiler GLSL vertex = 461.687 ms GLSL fragment = 105.462 ms GLSL linking = 14350.100 ms As you can see it took the Nvidia linker 14 seconds to link my shader but an Atom 330 at 1.6GHz isn't much of a processor. An i5-5250U 1.6GHz is taking 59 seconds with Mesa 10.6.3 to link the same shader. I had forgotten to update the Nvidia driver on the Atom/Ion box which now shows: OpenGL version: 3.3.0 NVIDIA 340.76 OpenGL renderer: ION/integrated/SSE2 GLSL version: 3.30 NVIDIA via Cg compiler GLSL vertex = 43.074 ms GLSL fragment = 3.454 ms GLSL linking = 120.391 ms Looks as if Nvidia did some stuff to speed up their linker. Created attachment 118110 [details]
piglit shader_runner shader test
I put Clive's shader code into the attached shader_runner test.
On a gen8 i7:
Mesa 10.6.4 release build: 44 seconds
Mesa ~master debug build: 267 seconds
Thank you Jordan, reproducibility is very important. I bet some folks were thinking I was crazy. It's possible to boil an egg in 267 seconds. By bisecting and jumping to some old commits I believe there are multiple small changes (from new features) affecting the total. There are several things taking the run from 30 to ~40secs but the biggest change is caused by commit 0ecdf04, this doubles the total time from ~40 secs to ~80 secs. Some times from my system: 0ecdf04 (83 secs) 1370fde (42 secs) 22ee7ee (30 secs) commit causing the biggest hop is: --- 8< --- commit 0ecdf04060518149e99a098caf4f6025fd6482a4 Author: Connor Abbott <cwabbott0@gmail.com> Date: Thu Jun 25 16:22:26 2015 -0700 i965/fs: emit constants only once Clearly, whatever my *weird* fragment shader is doing to the linker, shouldn't impede the excellent optimizations that are being made to the actual shader executable. My own work-around on an i3 is to show a fixed image that allows a projector to be focused for the ~90 seconds while the shader links. Factors of two, in the linking of the shader, are not an issue. There is a two or three orders of magnitude class of problem somewhere. An i7 far outclasses the Atom processor but the Nvidia linker gets the job done in 120ms and the i7 30 to 40 seconds. Profiling the linker, while it's working on the shader, should show up where all the cycles are being consumed. There are an awful lot of i7 cycles in 30 seconds. I built Ian Romanick's standalone Mesa compiler and use it as an adjunct to my system just to make sure my shaders are compliant and portable. I'm tempted to try and build the Mesa linker too but I don't know that I have the skills to do the job. (In reply to Clive McCarthy from comment #15) > Factors of two, in the linking of the shader, are not an issue. There is a > two or three orders of magnitude class of problem somewhere. An i7 far > outclasses the Atom processor but the Nvidia linker gets the job done in > 120ms and the i7 30 to 40 seconds. Are you exporting __GL_SHADER_DISK_CACHE=false? I've seen the time on an nvidia system go from 200ms to 2.1s after setting that variable. Just trying to separate out the effects of the nvidia shader cache... (In reply to Jordan Justen from comment #16) > (In reply to Clive McCarthy from comment #15) > > Factors of two, in the linking of the shader, are not an issue. There is a > > two or three orders of magnitude class of problem somewhere. An i7 far > > outclasses the Atom processor but the Nvidia linker gets the job done in > > 120ms and the i7 30 to 40 seconds. > > Are you exporting __GL_SHADER_DISK_CACHE=false? > > I've seen the time on an nvidia system go from > 200ms to 2.1s after setting that variable. > > Just trying to separate out the effects of the > nvidia shader cache... I'll set that environment variable and report back. I ran the application as root which disables the shader cache. The shader linked in 9.8 seconds on an Atom 330 at 1.6GHz, so I take it back, the problem isn't orders of magnitude but is something more than a factor of 4. Sorry for not understanding Nvidia's cache strategy. I re-ran my apples and oranges comparison from my first posting: an Nvidia driver on an i7-4790 platform, with the same GLSL code (but run as root to diable the cache) links in 600ms. So we maybe have a factor of ~70. Do you guys support glProgramBinary? If so this whole issue goes away. I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working with it today !! (In reply to Clive McCarthy from comment #21) > I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working > with it today !! But you won't... because we support 0 binary formats. :( Ian, Thanks for telling me ! I have been banging my head against the screen for an hour puzzling why it all seems good on my workstation but then, after 59 seconds linking on an i5-5250U, when I try and get the binary it returns zero bytes. Ahhhhhhhhhhhhhhhh! Is this one of those cunning tricks where someone offers to sell New York steaks for $5/lb except they are always out of stock? The call with GL_ARB_get_program_binary succeeds but now you tell me there are no binary formats? Isn't that somewhat of a wicked tease? Clive. "You mean there's a catch?" "Sure there's a catch", Doc Daneeka replied. "Catch-22. Anyone who wants to get out of combat duty isn't really crazy." There was only one catch and that was Catch-22, which specified that a concern for one's own safety in the face of dangers that were real and immediate was the process of a rational mind. Orr was crazy and could be grounded. All he had to do was ask; and as soon as he did, he would no longer be crazy and would have to fly more missions. Orr would be crazy to fly more missions and sane if he didn't, but if he was sane, he had to fly them. If he flew them, he was crazy and didn't have to; but if he didn't want to, he was sane and had to. Yossarian was moved very deeply by the absolute simplicity of this clause of Catch-22 and let out a respectful whistle. (In reply to Jordan Justen from comment #12) > Mesa 10.6.4 release build: 44 seconds > Mesa ~master debug build: 267 seconds Jordan, FWIW, linking in debug builds will be obscenely slow because we call nir_validate_shader() after every optimization/lowering pass. In release builds, that is a no-op...so measuring debug builds isn't terribly meaningful. I landed a few fixes in master that should help out a little...the front-end time is a lot more reasonable now. FS register allocation is still horridly slow, however. This is because of our stupid spilling loop. It takes 132 iterations through the RA loop before it succeeds. We can totally fix that, we just haven't made it a priority yet...most shaders don't trip up this pathological case. (In reply to Ian Romanick from comment #22) > (In reply to Clive McCarthy from comment #21) > > I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working > > with it today !! > > But you won't... because we support 0 binary formats. :( My *weird* shader is not intended to be pathological, further it seems a poor use of developer time to focus on optimizations when there is basic functionality needed. The effort that might go into dealing with this bug would be better spent getting the i965 to handle glProgramBinary in a _useful_ way. It would not only provide a reasonable workaround for the slow linking but it would add a useful feature. Happened upon this bug while browsing, briefly looked into it. Running that shader_test on my i7-4790: nouveau/nvc0: 17840 instructions, 0.88 seconds i965/hsw: 16567 instructions, 72 seconds Arguably the i965 and nouveau backends are at comparable levels of optimization sophistication, so it seems like i965 can benefit from some improvement/heuristics perhaps. (In reply to Ilia Mirkin from comment #27) > Happened upon this bug while browsing, briefly looked into it. Running that > shader_test on my i7-4790: > > nouveau/nvc0: 17840 instructions, 0.88 seconds > i965/hsw: 16567 instructions, 72 seconds Yeah, so that pretty much proves that it's the i965 register allocator not handling spilling decently. (In reply to Kenneth Graunke from comment #28) > (In reply to Ilia Mirkin from comment #27) > > Happened upon this bug while browsing, briefly looked into it. Running that > > shader_test on my i7-4790: > > > > nouveau/nvc0: 17840 instructions, 0.88 seconds > > i965/hsw: 16567 instructions, 72 seconds > > Yeah, so that pretty much proves that it's the i965 register allocator not > handling spilling decently. Well, running with INTEL_DEBUG=fs there were definitely several spurts of output which were ~seconds apart. So the register allocator might be 65 of those seconds, but I don't think it's the whole thing. "perf" profiling info for today's Mesa version: ------------------------------------------------------------------- $ sudo apt-get install linux-tools-common $ perf record -- frag_runner frag_shader=obsfucated_shader.frag ... $ perf report -n --sort comm,dso Overhead Samples Command Shared Object 95,18% 218725 frag_runner i965_dri.so 4,59% 10964 frag_runner libc-2.23.so ... $ perf report # Object Symbol 30.99% i965_dri.so fs_visitor::virtual_grf_interferes 28.39% i965_dri.so ra_allocate 15.80% i965_dri.so fs_visitor::assign_regs 4.47% i965_dri.so ra_add_node_adjacency 1.64% i965_dri.so (anonymous namespace)::ir_copy_propagation_visitor::visit 1.54% i965_dri.so decrement_q.isra.2 1.37% i965_dri.so (anonymous namespace)::ir_copy_propagation_visitor::kill 1.24% i965_dri.so ra_add_node_interference 1.22% i965_dri.so brw::fs_live_variables::compute_start_end 1.11% libc-2.23.so _int_malloc 0.94% i965_dri.so (anonymous namespace)::ir_copy_propagation_elements_visitor::kill 0.75% libc-2.23.so _int_free 0.69% libc-2.23.so __libc_calloc 0.63% i965_dri.so (anonymous namespace)::ir_copy_propagation_elements_visitor::handle_rvalue 0.60% libc-2.23.so realloc 0.45% i965_dri.so brw::fs_live_variables::setup_def_use 0.42% libc-2.23.so _int_realloc 0.40% i965_dri.so fs_visitor::choose_spill_reg 0.36% i965_dri.so ra_get_best_spill_node 0.35% i965_dri.so unsafe_free 0.34% i965_dri.so ir_expression::accept 0.32% libc-2.23.so __memset_sse2 0.27% libc-2.23.so malloc_consolidate 0.26% i965_dri.so match_value 0.25% libc-2.23.so __memcpy_sse2 0.25% i965_dri.so fs_visitor::calculate_payload_ranges 0.25% i965_dri.so get_used_mrfs 0.24% i965_dri.so backend_reg::in_range 0.24% i965_dri.so brw::fs_live_variables::compute_live_variables 0.22% i965_dri.so visit_list_elements 0.20% i965_dri.so fs_visitor::spill_reg 0.13% libc-2.23.so __memset_avx2 0.12% i965_dri.so reralloc_array_size ------------------------------------------------------------------- When using perf report interactively, one can see that: * virtual_grf_interferes() cost goes to the couple if checks in that function, so it just gets called too much. Calling is done by assign_regs(): ----------------- for (unsigned i = 0; i < this->alloc.count; i++) { ... for (unsigned j = 0; j < i; j++) { if (virtual_grf_interferes(i, j)) { ra_add_node_interference(g, i, j); } } ----------------- * All the assign_regs() cost goes to running above internal loop. * ra_allocate() cost goes mainly to inlined graph node loop from ra_simplify(), and slightly to inlined check from pq_test(). It seems that an important issue was identified a while back: commit 0ecdf04060518149e99a098caf4f6025fd6482a4 Author: Connor Abbott <cwabbott0@gmail.com> Date: Thu Jun 25 16:22:26 2015 -0700 It is now 2-1/2 years later and the issue isn't yet fixed. On current lower-end Intel processors some 2/3 of the die area is dedicated to the graphics processor. Thus Intel isn't in the CPU business but the GPU business with a little multi-core x86 CPU tacked on the side. Even the i7-7700HQ I'm using to write this is like that. I once wore an Intel badge. Get your act together guys. Andy Grove's ghost is looking at you. There have been several fixes in this area since this bug was filed (e.g. early this year), both to reduce spilling, and the impact of spilling, so that exponential spilling performance cost is an issue only for rare spill-like-crazy shaders. Fransisco has WIP patches that should fix most of the remaining issues. But for now, you can just use Jordan's shader cache branch: https://cgit.freedesktop.org/~jljusten/mesa/log/?h=i965-shader-cache (Reason why it hasn't gone upstream in last half a year is that there's a bug with one game that hasn't been found yet.) I have just (Nov 4th 2017) tested this issue with Mesa 17.4.0-devl The linkage now takes 46 seconds on an i3 (NUC5i3RYK) It seems to me that this issue would best be resolved by creating a binary format. (In reply to Clive McCarthy from comment #33) > I have just (Nov 4th 2017) tested this issue with Mesa 17.4.0-devl > The linkage now takes 46 seconds on an i3 (NUC5i3RYK) > > It seems to me that this issue would best be resolved by creating a binary > format. We have now "automatic shader cache" support in Mesa master. Please retry using Mesa master tree and using following environment variable: export MESA_GLSL_CACHE_DISABLE=0 On 2nd run you should get much faster end result. Shader cache will be enabled by default in some near future after final testing and validation is done. This shader now compiles and links ~13 sec on my HSW and current Mesa master (at 2d2e7954c36) and less than a second after that as the shader cache gets utilized. Clive, please test with latest Mesa release and report if you would agree this issue is resolved by now. I'll get the latest version from Oibaf and report back. I ran my code tonight and got: OpenGL version: 3.0 Mesa 19.2.0-devel (git-af0de6b 2019-05-27 bionic-oibaf-ppa) OpenGL vendor: Intel Open Source Technology Center OpenGL renderer: Mesa DRI Intel(R) HD Graphics (Coffeelake 3x8 GT3) GLSL vertex : 25,099 us GLSL fragment : 141,915 us GLSL linking : 2,764,600 us GLSL shader loaded The vertex shader is a little more complex than the trivial one before but the fragment shader is the same. The CPU is the faster i7-8559U however, congratulations on killing this bug. Note, I measured the time simply by using 'time' command with Piglit's shader_runner and the attached shader_test file so my measurement contains a bit more than just compile+link. Based on comment #37 I'm resolving this bug as fixed. Thanks for the report and testcase! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.