Bug 91857 - i965 spilling is slow
Summary: i965 spilling is slow
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Ian Romanick
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-09-02 23:28 UTC by Clive McCarthy
Modified: 2019-05-28 08:07 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
fragment shader (24.29 KB, text/plain)
2015-09-03 18:12 UTC, Clive McCarthy
Details
piglit shader_runner shader test (24.53 KB, text/plain)
2015-09-06 23:23 UTC, Jordan Justen
Details

Description Clive McCarthy 2015-09-02 23:28:41 UTC
The Mesa linker is dreadfully slow. Compilation of my GLSL source on an i3-4010U takes just 63.9ms. The linker takes 88.2 seconds, yes, seconds.

An apples and oranges comparison: an Nvidia driver on an i7 platform, with the same GLSL code, compiles in 31.1ms and links in 12.5ms.

Proportionately, this makes the Mesa linker 3,400 times slower than the Nvidia linker, apples-to-apples. My guess is that some linkage look-up has bad behaviour as the symbol table size grows.

The good news is that the shader runs just fine.
Comment 1 Tapani Pälli 2015-09-03 15:15:38 UTC
Without having the GLSL source it will be very hard to figure out what would be the key problem, there might be something in this very shader which makes things slow.
Comment 2 Tapani Pälli 2015-09-03 15:24:11 UTC
Please attach GLSL source, 88 secs sounds *weird*.
Comment 3 Clive McCarthy 2015-09-03 15:52:08 UTC
(In reply to Tapani Pälli from comment #2)
> Please attach GLSL source, 88 secs sounds *weird*.

I will attach it later today. It is indeed *weird* but apparently not weird for Nvidia. The vertex shader is quite trivial but the fragment shader has a moderately large switch statement.

Thank you for being interested.
Comment 4 Clive McCarthy 2015-09-03 16:10:15 UTC
(In reply to Tapani Pälli from comment #2)
> Please attach GLSL source, 88 secs sounds *weird*.

Here's the vertex shader. It will make you laugh.

varying vec2 tex_coord;
void main(void)
{
    tex_coord    = gl_MultiTexCoord0.st;
    gl_Position  = gl_ProjectionMatrix * gl_ModelViewMatrix * gl_Vertex;
}

Doesn't get simpler.
Comment 5 Matt Turner 2015-09-03 17:31:41 UTC
I feel confident the problem isn't caused by the vertex shader. :)

We do lots of optimizations at link time. It's extremely likely that we are hitting a pessimal case in one of the passes with your fragment shader. Can you post it as well?
Comment 6 Clive McCarthy 2015-09-03 18:12:12 UTC
Created attachment 118071 [details]
fragment shader

Here's my obsfucated shader. 

I realized a while ago that there was serious stuff going on in the linker, way beyond what a typical linker does, simply because of some the messages that were emitted by the linker and not the compiler.

The main function is just a long switch statement. It may be unusual but it's legal and it works.
Comment 7 Clive McCarthy 2015-09-03 18:27:42 UTC
It is perhaps the only writing in the world that has the words donald_trump and introspection...
Comment 8 Tapani Pälli 2015-09-04 10:26:28 UTC
Thanks for attaching the shader, will do some profiling on this later.

In case this is a 'real life' case and you are having problem with this: while this all is 'legal', IMO it would make more sense to write multiple shaders for different purposes and then minimize state changes by grouping the objects for rendering pass (a lot of apps and games do this). This will likely result in better performance on any given platform compared to a 'über-shader' approach.

What comes to 'apples and oranges' comparison, on Nvidia you will need to measure without shader cache. According to documentation below you should either remove '$HOME/.nv/GLCache' or export __GL_SHADER_DISK_CACHE=false before measuring (don't know if these work because docs seem a bit old):

http://us.download.nvidia.com/XFree86/Linux-x86/319.32/README/openglenvvariables.html

Then the comparison should be more fair.
Comment 9 Clive McCarthy 2015-09-04 14:34:15 UTC
(In reply to Tapani Pälli from comment #8)
> Thanks for attaching the shader, will do some profiling on this later.
> 
> In case this is a 'real life' case and you are having problem with this:
> while this all is 'legal', IMO it would make more sense to write multiple
> shaders for different purposes and then minimize state changes by grouping
> the objects for rendering pass (a lot of apps and games do this). This will
> likely result in better performance on any given platform compared to a
> 'über-shader' approach.
> 
> What comes to 'apples and oranges' comparison, on Nvidia you will need to
> measure without shader cache. According to documentation below you should
> either remove '$HOME/.nv/GLCache' or export __GL_SHADER_DISK_CACHE=false
> before measuring (don't know if these work because docs seem a bit old):
> 
> http://us.download.nvidia.com/XFree86/Linux-x86/319.32/README/
> openglenvvariables.html
> 
> Then the comparison should be more fair.

The shader itself is fast enough and works just fine (weird as it might be). The speed issue is with the Mesa linker, running x86 code. My über-shader allows me to try various approaches and experiment very quickly.

The timing of the linker was done with gettimeofday() and represents how long the linker took before any rendering even began. This is not a rendering problem.
Comment 10 Clive McCarthy 2015-09-04 17:12:45 UTC
Just for grins I did this bananas and pears comparison: I ran my code on an Atom 330 1.6GHz with an Nvidia ION GPU. This was once my main target platform before the Intel NUCs arrived and effectively killed off Nvidia's low-end graphics.

	OpenGL version:		2.1.2 NVIDIA 331.38
	OpenGL renderer:	ION/integrated/SSE2
	GLSL version:		1.20 NVIDIA via Cg compiler
	GLSL vertex         =   461.687 ms
	GLSL fragment       =   105.462 ms
	GLSL linking        = 14350.100 ms

As you can see it took the Nvidia linker 14 seconds to link my shader but an Atom 330 at 1.6GHz isn't much of a processor. An i5-5250U 1.6GHz is taking 59 seconds with Mesa 10.6.3 to link the same shader.
Comment 11 Clive McCarthy 2015-09-04 18:19:37 UTC
I had forgotten to update the Nvidia driver on the Atom/Ion box which now shows:

	OpenGL version:		3.3.0 NVIDIA 340.76
	OpenGL renderer:	ION/integrated/SSE2
	GLSL version:		3.30 NVIDIA via Cg compiler
	GLSL vertex         =   43.074 ms
	GLSL fragment       =    3.454 ms
	GLSL linking        =  120.391 ms

Looks as if Nvidia did some stuff to speed up their linker.
Comment 12 Jordan Justen 2015-09-06 23:23:28 UTC
Created attachment 118110 [details]
piglit shader_runner shader test

I put Clive's shader code into the attached shader_runner test.

On a gen8 i7:

Mesa 10.6.4 release build: 44 seconds

Mesa ~master debug build: 267 seconds
Comment 13 Clive McCarthy 2015-09-07 01:03:55 UTC
Thank you Jordan, reproducibility is very important. I bet some folks were thinking I was crazy. It's possible to boil an egg in 267 seconds.
Comment 14 Tapani Pälli 2015-09-07 06:53:08 UTC
By bisecting and jumping to some old commits I believe there are multiple small changes (from new features) affecting the total. There are several things taking the run from 30 to ~40secs but the biggest change is caused by commit 0ecdf04, this doubles the total time from ~40 secs to ~80 secs.

Some times from my system:

0ecdf04 (83 secs)
1370fde (42 secs)
22ee7ee (30 secs)

commit causing the biggest hop is:
--- 8< ---
commit 0ecdf04060518149e99a098caf4f6025fd6482a4
Author: Connor Abbott <cwabbott0@gmail.com>
Date:   Thu Jun 25 16:22:26 2015 -0700

    i965/fs: emit constants only once
Comment 15 Clive McCarthy 2015-09-07 18:12:26 UTC
Clearly, whatever my *weird* fragment shader is doing to the linker, shouldn't impede the excellent optimizations that are being made to the actual shader executable.

My own work-around on an i3 is to show a fixed image that allows a projector to be focused for the ~90 seconds while the shader links.

Factors of two, in the linking of the shader, are not an issue. There is a two or three orders of magnitude class of problem somewhere. An i7 far outclasses the Atom processor but the Nvidia linker gets the job done in 120ms and the i7 30 to 40 seconds.

Profiling the linker, while it's working on the shader, should show up where all the cycles are being consumed. There are an awful lot of i7 cycles in 30 seconds.

I built Ian Romanick's standalone Mesa compiler and use it as an adjunct to my system just to make sure my shaders are compliant and portable. I'm tempted to try and build the Mesa linker too but I don't know that I have the skills to do the job.
Comment 16 Jordan Justen 2015-09-08 06:14:13 UTC
(In reply to Clive McCarthy from comment #15)
> Factors of two, in the linking of the shader, are not an issue. There is a
> two or three orders of magnitude class of problem somewhere. An i7 far
> outclasses the Atom processor but the Nvidia linker gets the job done in
> 120ms and the i7 30 to 40 seconds.

Are you exporting __GL_SHADER_DISK_CACHE=false?

I've seen the time on an nvidia system go from
200ms to 2.1s after setting that variable.

Just trying to separate out the effects of the
nvidia shader cache...
Comment 17 Clive McCarthy 2015-09-08 14:30:33 UTC
(In reply to Jordan Justen from comment #16)
> (In reply to Clive McCarthy from comment #15)
> > Factors of two, in the linking of the shader, are not an issue. There is a
> > two or three orders of magnitude class of problem somewhere. An i7 far
> > outclasses the Atom processor but the Nvidia linker gets the job done in
> > 120ms and the i7 30 to 40 seconds.
> 
> Are you exporting __GL_SHADER_DISK_CACHE=false?
> 
> I've seen the time on an nvidia system go from
> 200ms to 2.1s after setting that variable.
> 
> Just trying to separate out the effects of the
> nvidia shader cache...

I'll set that environment variable and report back.
Comment 18 Clive McCarthy 2015-09-08 14:45:48 UTC
I ran the application as root which disables the shader cache. The shader linked in 9.8 seconds on an Atom 330 at 1.6GHz, so I take it back, the problem isn't orders of magnitude but is something more than a factor of 4.

Sorry for not understanding Nvidia's cache strategy.
Comment 19 Clive McCarthy 2015-09-08 15:31:27 UTC
I re-ran my apples and oranges comparison from my first posting: an Nvidia driver on an i7-4790 platform, with the same GLSL code (but run as root to diable the cache) links in 600ms. So we maybe have a factor of ~70.
Comment 20 Clive McCarthy 2015-09-08 15:45:24 UTC
Do you guys support glProgramBinary? If so this whole issue goes away.
Comment 21 Clive McCarthy 2015-09-08 17:46:58 UTC
I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working with it today !!
Comment 22 Ian Romanick 2015-09-08 20:58:36 UTC
(In reply to Clive McCarthy from comment #21)
> I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working
> with it today !!

But you won't... because we support 0 binary formats. :(
Comment 23 Clive McCarthy 2015-09-08 21:47:47 UTC
Ian, Thanks for telling me ! 

I have been banging my head against the screen for an hour puzzling why it all seems good on my workstation but then, after 59 seconds linking on an i5-5250U, when I try and get the binary it returns zero bytes. Ahhhhhhhhhhhhhhhh!

Is this one of those cunning tricks where someone offers to sell New York steaks for $5/lb except they are always out of stock? The call with GL_ARB_get_program_binary succeeds but now you tell me there are no binary formats?

Isn't that somewhat of a wicked tease?
Clive.
Comment 24 Clive McCarthy 2015-09-08 21:55:53 UTC
"You mean there's a catch?"
"Sure there's a catch", Doc Daneeka replied. "Catch-22. Anyone who wants to get out of combat duty isn't really crazy."
There was only one catch and that was Catch-22, which specified that a concern for one's own safety in the face of dangers that were real and immediate was the process of a rational mind. Orr was crazy and could be grounded. All he had to do was ask; and as soon as he did, he would no longer be crazy and would have to fly more missions. Orr would be crazy to fly more missions and sane if he didn't, but if he was sane, he had to fly them. If he flew them, he was crazy and didn't have to; but if he didn't want to, he was sane and had to. Yossarian was moved very deeply by the absolute simplicity of this clause of Catch-22 and let out a respectful whistle.
Comment 25 Kenneth Graunke 2015-09-13 07:08:43 UTC
(In reply to Jordan Justen from comment #12) 
> Mesa 10.6.4 release build: 44 seconds
> Mesa ~master debug build: 267 seconds

Jordan, FWIW, linking in debug builds will be obscenely slow because we call nir_validate_shader() after every optimization/lowering pass.  In release builds, that is a no-op...so measuring debug builds isn't terribly meaningful.

I landed a few fixes in master that should help out a little...the front-end time is a lot more reasonable now.  FS register allocation is still horridly slow, however.  This is because of our stupid spilling loop.  It takes 132 iterations through the RA loop before it succeeds.

We can totally fix that, we just haven't made it a priority yet...most shaders don't trip up this pathological case.
Comment 26 Clive McCarthy 2015-09-13 15:58:36 UTC
(In reply to Ian Romanick from comment #22)
> (In reply to Clive McCarthy from comment #21)
> > I see that Mesa 10.6.3 does support glProgramBinary -- I shall start working
> > with it today !!
> 
> But you won't... because we support 0 binary formats. :(

My *weird* shader is not intended to be pathological, further it seems a poor use of developer time to focus on optimizations when there is basic functionality needed.

The effort that might go into dealing with this bug would be better spent getting the i965 to handle glProgramBinary in a _useful_ way. It would not only provide a reasonable workaround for the slow linking but it would add a useful feature.
Comment 27 Ilia Mirkin 2016-01-22 19:12:44 UTC
Happened upon this bug while browsing, briefly looked into it. Running that shader_test on my i7-4790:

nouveau/nvc0: 17840 instructions, 0.88 seconds
i965/hsw: 16567 instructions, 72 seconds

Arguably the i965 and nouveau backends are at comparable levels of optimization sophistication, so it seems like i965 can benefit from some improvement/heuristics perhaps.
Comment 28 Kenneth Graunke 2016-01-22 21:30:07 UTC
(In reply to Ilia Mirkin from comment #27)
> Happened upon this bug while browsing, briefly looked into it. Running that
> shader_test on my i7-4790:
> 
> nouveau/nvc0: 17840 instructions, 0.88 seconds
> i965/hsw: 16567 instructions, 72 seconds

Yeah, so that pretty much proves that it's the i965 register allocator not handling spilling decently.
Comment 29 Ilia Mirkin 2016-01-22 21:59:45 UTC
(In reply to Kenneth Graunke from comment #28)
> (In reply to Ilia Mirkin from comment #27)
> > Happened upon this bug while browsing, briefly looked into it. Running that
> > shader_test on my i7-4790:
> > 
> > nouveau/nvc0: 17840 instructions, 0.88 seconds
> > i965/hsw: 16567 instructions, 72 seconds
> 
> Yeah, so that pretty much proves that it's the i965 register allocator not
> handling spilling decently.

Well, running with INTEL_DEBUG=fs there were definitely several spurts of output which were ~seconds apart. So the register allocator might be 65 of those seconds, but I don't think it's the whole thing.
Comment 30 Eero Tamminen 2016-08-29 12:57:12 UTC
"perf" profiling info for today's Mesa version:
-------------------------------------------------------------------
$ sudo apt-get install linux-tools-common
$ perf record -- frag_runner frag_shader=obsfucated_shader.frag ...
$ perf report -n --sort comm,dso
Overhead       Samples  Command      Shared Object
  95,18%        218725  frag_runner  i965_dri.so
   4,59%         10964  frag_runner  libc-2.23.so
...
$ perf report
#       Object        Symbol                              
30.99%  i965_dri.so   fs_visitor::virtual_grf_interferes                       
28.39%  i965_dri.so   ra_allocate                                              
15.80%  i965_dri.so   fs_visitor::assign_regs                                  
 4.47%  i965_dri.so   ra_add_node_adjacency                                    
 1.64%  i965_dri.so   (anonymous namespace)::ir_copy_propagation_visitor::visit
 1.54%  i965_dri.so   decrement_q.isra.2                                       
 1.37%  i965_dri.so   (anonymous namespace)::ir_copy_propagation_visitor::kill 
 1.24%  i965_dri.so   ra_add_node_interference                                 
 1.22%  i965_dri.so   brw::fs_live_variables::compute_start_end                
 1.11%  libc-2.23.so  _int_malloc                                              
 0.94%  i965_dri.so   (anonymous namespace)::ir_copy_propagation_elements_visitor::kill
 0.75%  libc-2.23.so  _int_free                                                
 0.69%  libc-2.23.so  __libc_calloc                                            
 0.63%  i965_dri.so   (anonymous namespace)::ir_copy_propagation_elements_visitor::handle_rvalue
 0.60%  libc-2.23.so  realloc                                                  
 0.45%  i965_dri.so   brw::fs_live_variables::setup_def_use                    
 0.42%  libc-2.23.so  _int_realloc                                             
 0.40%  i965_dri.so   fs_visitor::choose_spill_reg                             
 0.36%  i965_dri.so   ra_get_best_spill_node                                   
 0.35%  i965_dri.so   unsafe_free                                              
 0.34%  i965_dri.so   ir_expression::accept                                    
 0.32%  libc-2.23.so  __memset_sse2                                            
 0.27%  libc-2.23.so  malloc_consolidate                                       
 0.26%  i965_dri.so   match_value                                              
 0.25%  libc-2.23.so  __memcpy_sse2                                            
 0.25%  i965_dri.so   fs_visitor::calculate_payload_ranges                     
 0.25%  i965_dri.so   get_used_mrfs                                            
 0.24%  i965_dri.so   backend_reg::in_range                                    
 0.24%  i965_dri.so   brw::fs_live_variables::compute_live_variables           
 0.22%  i965_dri.so   visit_list_elements                                      
 0.20%  i965_dri.so   fs_visitor::spill_reg                                    
 0.13%  libc-2.23.so  __memset_avx2                                            
 0.12%  i965_dri.so   reralloc_array_size
-------------------------------------------------------------------

When using perf report interactively, one can see that:
* virtual_grf_interferes() cost goes to the couple if checks in that function, so it just gets called too much.  Calling is done by assign_regs():
  -----------------
   for (unsigned i = 0; i < this->alloc.count; i++) {
      ...
      for (unsigned j = 0; j < i; j++) {
         if (virtual_grf_interferes(i, j)) {
            ra_add_node_interference(g, i, j);
         }
      }
  -----------------
* All the assign_regs() cost goes to running above internal loop.
* ra_allocate() cost goes mainly to inlined graph node loop from ra_simplify(), and slightly to inlined check from pq_test().
Comment 31 Clive McCarthy 2017-09-15 04:00:16 UTC
It seems that an important issue was identified a while back:

commit 0ecdf04060518149e99a098caf4f6025fd6482a4
Author: Connor Abbott <cwabbott0@gmail.com>
Date:   Thu Jun 25 16:22:26 2015 -0700

It is now 2-1/2 years later and the issue isn't yet fixed. On current lower-end Intel processors some 2/3 of the die area is dedicated to the graphics processor. Thus Intel isn't in the CPU business but the GPU business with a little multi-core x86 CPU tacked on the side. Even the i7-7700HQ I'm using to write this is like that.

I once wore an Intel badge. Get your act together guys. Andy Grove's ghost is looking at you.
Comment 32 Eero Tamminen 2017-09-15 08:46:28 UTC
There have been several fixes in this area since this bug was filed (e.g. early this year), both to reduce spilling, and the impact of spilling, so that exponential spilling performance cost is an issue only for rare spill-like-crazy shaders.

Fransisco has WIP patches that should fix most of the remaining issues.


But for now, you can just use Jordan's shader cache branch:
https://cgit.freedesktop.org/~jljusten/mesa/log/?h=i965-shader-cache

(Reason why it hasn't gone upstream in last half a year is that there's a bug with one game that hasn't been found yet.)
Comment 33 Clive McCarthy 2017-11-04 21:49:49 UTC
I have just (Nov 4th 2017) tested this issue with Mesa 17.4.0-devl
The linkage now takes 46 seconds on an i3 (NUC5i3RYK)

It seems to me that this issue would best be resolved by creating a binary format.
Comment 34 Tapani Pälli 2017-11-06 08:06:33 UTC
(In reply to Clive McCarthy from comment #33)
> I have just (Nov 4th 2017) tested this issue with Mesa 17.4.0-devl
> The linkage now takes 46 seconds on an i3 (NUC5i3RYK)
> 
> It seems to me that this issue would best be resolved by creating a binary
> format.

We have now "automatic shader cache" support in Mesa master. Please retry using Mesa master tree and using following environment variable:

export MESA_GLSL_CACHE_DISABLE=0

On 2nd run you should get much faster end result. Shader cache will be enabled by default in some near future after final testing and validation is done.
Comment 35 Tapani Pälli 2019-05-27 08:19:53 UTC
This shader now compiles and links ~13 sec on my HSW and current Mesa master (at 2d2e7954c36) and less than a second after that as the shader cache gets utilized. Clive, please test with latest Mesa release and report if you would agree this issue is resolved by now.
Comment 36 Clive McCarthy 2019-05-27 14:27:22 UTC
I'll get the latest version from Oibaf and report back.
Comment 37 Clive McCarthy 2019-05-28 07:44:11 UTC
I ran my code tonight and got:

	OpenGL version:		3.0 Mesa 19.2.0-devel (git-af0de6b 2019-05-27 bionic-oibaf-ppa)
	OpenGL vendor:		Intel Open Source Technology Center
	OpenGL renderer:	Mesa DRI Intel(R) HD Graphics (Coffeelake 3x8 GT3) 
        GLSL vertex     :       25,099 us
        GLSL fragment   :      141,915 us
        GLSL linking    :    2,764,600 us
	GLSL shader loaded

The vertex shader is a little more complex than the trivial one before but the fragment shader is the same.

The CPU is the faster i7-8559U however, congratulations on killing this bug.
Comment 38 Tapani Pälli 2019-05-28 08:07:51 UTC
Note, I measured the time simply by using 'time' command with Piglit's shader_runner and the attached shader_test file so my measurement contains a bit more than just compile+link.

Based on comment #37 I'm resolving this bug as fixed. Thanks for the report and testcase!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.