Bug 110412 - Over 15% performance lost on large branching shader
Summary: Over 15% performance lost on large branching shader
Status: RESOLVED DUPLICATE of bug 109517
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-04-12 11:23 UTC by Kevin Rogovin
Modified: 2019-04-17 14:36 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
output from Mesa 18.2 (825.82 KB, text/plain)
2019-04-12 11:23 UTC, Kevin Rogovin
Details
output from Mesa Git (824.10 KB, text/plain)
2019-04-12 11:23 UTC, Kevin Rogovin
Details

Description Kevin Rogovin 2019-04-12 11:23:10 UTC
Created attachment 143948 [details]
output from Mesa 18.2

A large branching shader suffers an over 15% performance loss comparing Mesa GIT (revision a182adfd83ad00e326153b00a725a014e0359bf0) against Mesa 18.2.8 (on Ubuntu 18.04).

To replicate,

 1. Build painter-glyph-test-GL-debug from the project https://github.com/intel/fastuidraw
 2. Run (adjusting width and height options to match ones monitor) with
    
     vblank_mode=0 LD_LIBRARY_PATH=. ./painter-glyph-test-GL-release fullscreen true width 1920 height 1200 use_file true text demo_data/txt/wall_of_text_caps_no_numbers.txt

On my Iris Pro Graphics 580 (Skylake GT4e), I see (with fluctuations):

   Mesa 18.2.8: 5.6 ms/frame [178 FPS]
   a182adfd83ad00e326153b00a725a014e0359bf0: 6.5 ms/frame [153 FPS]

The shader being executed is a large uber-shader. In both tested Mesa's above, the uber-shader is realized as only SIMD8 with no spilling.

Attached are the outputs when running with MESA_GLSL_CACHE_DISABLE=true INTEL_DEBUG=fs for the offending fragment shader.
Comment 1 Kevin Rogovin 2019-04-12 11:23:41 UTC
Created attachment 143949 [details]
output from Mesa Git
Comment 2 Paul 2019-04-12 15:29:38 UTC
Hi Kevin
I've compiled the test and run it, but I'm not sure how to compare FPS.
How did you check them? Did you use special tools for it or some flag in test or something else?
Comment 3 Kevin Rogovin 2019-04-12 15:32:19 UTC
Press the "L" key (atleast on US Keyboards) to bring up a jazz with FPS and other things. At startup, a list of all what all key presses are printed to stdout. If you are sufficiently masochistic, yo can run the program with the single command line argument "--help" to see all command line options.

Just to make sure all is good, did the demos as-is draw a wall of text to the screen?

-Kevin
Comment 4 Eero Tamminen 2019-04-12 15:39:52 UTC
Looking at the attached shader assembly...

Mesa 18.2:
SIMD8 shader: 2413 instructions. 11 loops. 131452 cycles. 0:0 spills:fills. Promoted 15 constants. Compacted 38608 to 27856 bytes (28%)

Mesa git:
SIMD8 shader: 2388 instructions. 11 loops. 120307 cycles. 0:0 spills:fills. Promoted 14 constants. Compacted 38208 to 27392 bytes (28%)

=> Both versions reach only SIMD8 and new version uses less instructions.

Loops in git version are shorter, except for last two which are marginally longer:

Mesa 18.2:
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -216                                       { align1 1Q };
while(8)        JIP: -296                                       { align1 1Q };
while(8)        JIP: -4496                                      { align1 1Q };
while(8)        JIP: -1136                                      { align1 1Q };
while(8)        JIP: -1136                                      { align1 1Q };

Mesa git:
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -200                                       { align1 1Q };
while(8)        JIP: -288                                       { align1 1Q };
while(8)        JIP: -4424                                      { align1 1Q };
while(8)        JIP: -1144                                      { align1 1Q };
while(8)        JIP: -1144                                      { align1 1Q };

At maximum, old code seems to have 61 live regs, new one 62.

Both Mesa and my own (crappy) ISA analyzer think that the new version (which has more lrp & mad reg bank conflicts) should use less cycles, but in branching code that can't really be predicted as it depends so much on which branches get selected.
Comment 5 Kevin Rogovin 2019-04-12 15:42:05 UTC
For this test, what branches that gets hit are all the same.

Did you get the demo to run to verify the performance drop?
Comment 6 Kevin Rogovin 2019-04-12 15:45:01 UTC
if you add "painter_use_uber_item_shader false" to the command line, that should make the shader much less uber-ish for analysis (though I confess I have not compared the benchmark numbers for this case yet).
Comment 7 Kevin Rogovin 2019-04-12 16:08:12 UTC
Hi,

Apparently I added a show_framerate option which prints to stdout the average frametime across all frames. To use it, add "show_framerate true" to the command line. If one pulls (i.e. git commit 203b84c336c0c013cae670766182c5ea81cd0711 or newer) there is a "warm-up counter" to avoid including in the average the first few N-frames.

-Kevin
Comment 8 Paul 2019-04-15 08:51:35 UTC
Hi guys
Kevin, thanks for the tip - it works.
I've bisected the mesa between mesa-18.2.8(785e09e3b3) and latest master version of Mesa (04e672257c) on Skylake with Intel® HD Graphics 520.
Bisect brought me to the
commit a920979d4f30a48a23f8ff375ce05fa8a947dd96
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Nov 16 10:46:27 2018 -0600

	intel/fs: Use split sends for surface writes on gen9+
    
	Surface reads don't need them because they just have the one address
	payload.  With surface writes, on the other hand, we can put the address
	and the data in the different halves and avoid building the payload all
	together.
    
	The decrease in register pressure and added freedom in register
	allocation resulting from this change reduces spilling enough to improve
	the performance of one customer benchmark by about 2x.
    
	Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>
commit a920979d4f30a48a23f8ff375ce05fa8a947dd96
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Fri Nov 16 10:46:27 2018 -0600

	intel/fs: Use split sends for surface writes on gen9+
    
	Surface reads don't need them because they just have the one address
	payload.  With surface writes, on the other hand, we can put the address
	and the data in the different halves and avoid building the payload all
	together.
    
	The decrease in register pressure and added freedom in register
	allocation resulting from this change reduces spilling enough to improve
	the performance of one customer benchmark by about 2x.
    
	Reviewed-by: Iago Toral Quiroga <itoral@igalia.com>

Bad commits had 60 FPS, good commits had 70 FPS on my machine.
Comment 9 Kevin Rogovin 2019-04-15 10:39:55 UTC
Thankyou for the work of finding the offending commit!

I confess though, this leaves even more mysteries since the commit message stats the change is only for surface write messages and the shaders in the benchmark should only have surface writes only at the very end: writing to the render target (dual-src).

Hopefully, someone from the Intel Mesa team will pick this up and investigate.
Comment 10 Paul 2019-04-16 10:50:48 UTC
Hi guys 
Looks like, that it's a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=110344 
Jason has described all scope of the work in it.
I'm adding a ticket to 'see also' section.
Comment 11 Jason Ekstrand 2019-04-16 14:35:27 UTC

*** This bug has been marked as a duplicate of bug 110344 ***
Comment 12 Jason Ekstrand 2019-04-17 14:36:28 UTC

*** This bug has been marked as a duplicate of bug 109507 ***
Comment 13 Jason Ekstrand 2019-04-17 14:36:59 UTC

*** This bug has been marked as a duplicate of bug 109517 ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.