99221 – 2-3% perf drop in PSPom with "i965: disable loop unrolling in GLSL IR"

Bug 99221 - 2-3% perf drop in PSPom with "i965: disable loop unrolling in GLSL IR"

Summary: 2-3% perf drop in PSPom with "i965: disable loop unrolling in GLSL IR"

Status:	VERIFIED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-12-29 14:57 UTC by Eero Tamminen
Modified:	2017-05-03 13:36 UTC (History)
CC List:	1 user (show)

See Also:	100913
i915 platform:
i915 features:

Attachments
TREX loop unrolled in GLSL IR (17.94 KB, text/plain) 2016-12-30 09:27 UTC, Timothy Arceri	Details
TREX loop unrolled in NIR (17.29 KB, text/plain) 2016-12-30 09:28 UTC, Timothy Arceri	Details
Fix loop interation count for floats (952 bytes, patch) 2017-01-03 01:32 UTC, Timothy Arceri	Details \| Splinter Review
View All

Description Eero Tamminen 2016-12-29 14:57:41 UTC

Patch series culminating in this commit changed performance in several tests:
-------------------------------------------------------
commit 40e9f2f13847ddd94e1216088aa00456d7b02d2b
Author:     Timothy Arceri <timothy.arceri@collabora.com>
AuthorDate: Tue Dec 13 11:37:25 2016 +1100
Commit:     Timothy Arceri <timothy.arceri@collabora.com>
CommitDate: Fri Dec 23 10:15:36 2016 +1100

    i965: disable loop unrolling in GLSL IR
    
    There is a single regression in loop unrolling which is:
    
    loops HURT:   shaders/orbital_explorer.shader_test GS SIMD8:    0 -> 1
    
    However the loop is huge so it seems reasonable not to unroll it. It's
    surprising that GLSL IR does unroll it.
-------------------------------------------------------

On SKL i5-6600K (GT2), the changes were following (in FullHD size):

Performance dropped due to "disable loop unrolling in GLSL IR":
- 2.7% SynMark PSPom
- 2.3% SynMark PSPhong
- 2.2% GfxBench T-Rex (GL version)
- 0.5% SynMark PSBump8

Performance increased due to "use nir loop unrolling pass":
+ 12.5% SynMark ShMapPcf
+ 3.9% SynMark CSDof (+8.4% from "use nir_lower_indirect_derefs() for GLSL")
+ 2.2% SynMark DevRes (composite test including other affected tests)
+ 0.5% Unigine Valley
+ 0.5% SynMark PSBump2

Results are similar on other platforms, except for CSDof where results depend a lot on the HW.  GfxBench is v4.0, SynMark v7.0.

CSDof performance increased a lot on SKL GT2, KBL GT2 and a bit BYT, but its perf dropped a lot SKL GT3e & GT4e, BDW GT3 & GT2, BSW, and a bit on HSW GT3e and BXT.

After these 2 changes, SynMark shader compilation speed test is ~25% faster on all platforms which is pretty good.

Comment 1 Timothy Arceri 2016-12-29 21:55:15 UTC

Unfortunately I don't have access to these benchmarks so its a little hard to know what caused the regressions.

It's possible Ken's copy propagation fix [1] could help. Or possibly the Curro's upcoming work on the scheduler.

[1] https://patchwork.freedesktop.org/patch/127114/

Comment 2 Timothy Arceri 2016-12-29 21:57:53 UTC

Oh I see GFXBench is free to download. Will check it out.

Comment 3 Eero Tamminen 2016-12-30 08:57:44 UTC

(In reply to Timothy Arceri from comment #2)
> Oh I see GFXBench is free to download. Will check it out.

Note: the free version doesn't support automation, so you can just get the shaders, any perf testing unfortunately would need to be manual.

(Once you've verified a set of patches that help with T-Rex shaders, I can do  automated runs to check the perf for the affected benchmarks.)


I went through all the T-Rex GLSL shaders.  They contain a huge amount of dead code and are in the end fairly simple.  Only little control flow, and if one ignores the dead code, they're fairly short.  Several use discard though.

The only shader with loop in it is this motion blur fragment shader (which is known to have clear impact on the test performance):
------------------------
uniform sampler2D texture_unit0;
uniform sampler2D texture_unit1;
varying vec2 out_texcoord0;

void main()
{
        const int n = 4;
        vec3 motion = texture2D(texture_unit1, out_texcoord0).xyz;
        vec4 texel = texture2D(texture_unit0, out_texcoord0);
        vec3 color = texel.xyz / float (n);
        motion.xy = (motion.xy - 0.5) * vec2(0.0666667, .125);
        motion.xy *= texel.a;
        for (int i = 1; i < n; i++) {
                vec2 tc = out_texcoord0 - motion.xy * float (i);
                color += texture2D(texture_unit0, tc).xyz / float (n);
        }
        gl_FragColor.xyz = color;
}
------------------------
(I removed all the dead code, uniforms, varyings etc.)


All the other loops in T-Rex shaders are dead code.

Comment 4 Eero Tamminen 2016-12-30 09:12:37 UTC

(To find the full motion blur shader from your T-Rex dumps, just grep for "motion", it's the only one using a variable by that name.)

Comment 5 Timothy Arceri 2016-12-30 09:27:45 UTC

Created attachment 128692 [details]
TREX loop unrolled in GLSL IR

Comment 6 Timothy Arceri 2016-12-30 09:28:18 UTC

Created attachment 128693 [details]
TREX loop unrolled in NIR

Comment 7 Timothy Arceri 2016-12-30 09:31:57 UTC

I've attached dumps from unrolling the loops with the different passes. It doesn't look like there is any easy solution, both have the same instruction count it looks like the result comes down to scheduling.

Comment 8 Eero Tamminen 2016-12-30 12:59:40 UTC

(In reply to Timothy Arceri from comment #7)
> I've attached dumps from unrolling the loops with the different passes. It
> doesn't look like there is any easy solution, both have the same instruction
> count it looks like the result comes down to scheduling.

NIR version schedules the second last "send" worse, and it has more register bank conflicts with "mad" instructions.

Is it possible that the patch would affect anything else besides shaders with (non-dead) loops?

Comment 9 Eero Tamminen 2016-12-30 14:52:17 UTC

SynMark PSPhong case seems similar (worse scheduling and more register bank conflicts).

In the SynMark PSPom case, fragment shader has two loops of which GLSL IR can unroll the other, but NIR doesn't unroll either of them.

Should NIR be able to unroll loops with non-integer loop counter, like GLSL IR did:
------------------------
for (float i = 0.02; i < 0.9; i += 0.11)
   maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 - i) * shadowSoftening;
------------------------
?

Comment 10 Matt Turner 2016-12-30 16:15:32 UTC

(In reply to Eero Tamminen from comment #9)
> SynMark PSPhong case seems similar (worse scheduling and more register bank
> conflicts).
> 
> In the SynMark PSPom case, fragment shader has two loops of which GLSL IR
> can unroll the other, but NIR doesn't unroll either of them.
> 
> Should NIR be able to unroll loops with non-integer loop counter, like GLSL
> IR did:
> ------------------------
> for (float i = 0.02; i < 0.9; i += 0.11)
>    maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 -
> i) * shadowSoftening;
> ------------------------
> ?

This is going to get really confusing really quickly is we're trying to debug 5 different regressions in the same bug. I recommend filing new bugs for each on that we're analyzing and making them block a TRACKER bug.

Comment 11 Eero Tamminen 2017-01-02 09:41:42 UTC

(In reply to Matt Turner from comment #10)
> (In reply to Eero Tamminen from comment #9)
> > SynMark PSPhong case seems similar (worse scheduling and more register bank
> > conflicts).
> > 
> > In the SynMark PSPom case, fragment shader has two loops of which GLSL IR
> > can unroll the other, but NIR doesn't unroll either of them.
> > 
> > Should NIR be able to unroll loops with non-integer loop counter, like GLSL
> > IR did:
> > ------------------------
> > for (float i = 0.02; i < 0.9; i += 0.11)
> >    maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 -
> > i) * shadowSoftening;
> > ------------------------
> > ?

Few other things that in shader-db had non-integer loop counters with fixed counts:
- Invisible Inc
- Talos Principle
- Steam Big Picture
- Serious Sam 3

Manhattan, Unigine demos (at Ultra level), Nexuiz and Xonotic had float loops without fixed count.


> This is going to get really confusing really quickly is we're trying to
> debug 5 different regressions in the same bug. I recommend filing new bugs
> for each on that we're analyzing and making them block a TRACKER bug.

While there are 4 benchmarks that are known to be regressed, so far it's one issue, and potential issue with float counter loops.

Does/should NIR support unrolling non-integer loop counters?

Comment 12 Timothy Arceri 2017-01-02 09:49:28 UTC

(In reply to Eero Tamminen from comment #11)
> Does/should NIR support unrolling non-integer loop counters?

Yes it should. I'll take a look at the example you gave tomorrow to see what is going wrong.

Comment 13 Eero Tamminen 2017-01-02 14:36:05 UTC

PSBump8 case can be ignored.  While there are some scheduling differences, there are no changes for send, nor change in instruction counts, and the perf change is close to variance Additionally, perf of PSBump2, which is otherwise same test but just has smaller loop count, went into other direction.

Comment 14 Timothy Arceri 2017-01-03 01:32:53 UTC

Created attachment 128716 [details] [review]
Fix loop interation count for floats

Does this help with unrolling the loop?

Comment 15 Eero Tamminen 2017-01-03 14:41:44 UTC

Yes, with that, the float loop is unrolled and PSPom perf regression is fixed.

This:
SIMD16 shader: 156 instructions. 2 loops. 6802 cycles. 0:0 spills:fills. Promoted 4 constants. Compacted 2496 to 1552 bytes (38%)
Changed to:
SIMD16 shader: 197 instructions. 1 loops. 5924 cycles. 0:0 spills:fills. Promoted 17 constants. Compacted 3152 to 2176 bytes (31%)

Comment 16 Timothy Arceri 2017-01-04 03:49:58 UTC

(In reply to Eero Tamminen from comment #15)
> Yes, with that, the float loop is unrolled and PSPom perf regression is
> fixed.
> 

Cool. I've pushed that fix.

commit 4b7dfd881296a542a0c08a12c27f643dabd7280c
Author: Timothy Arceri <timothy.arceri@collabora.com>
Date:   Tue Jan 3 12:03:54 2017 +1100

    nir: fix loop iteration count calculation for floats
    
    Fixes performance regression in SynMark PSPom caused by loops with float
    counters not always unrolling.
    
    For example:
    
       for (float i = 0.02; i < 0.9; i += 0.11)
          ...
    
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>

Comment 17 Eero Tamminen 2017-01-05 09:33:47 UTC

The commit fixes PSPom perf on all platforms where I have data (on GEN6 & GEN7 there was no clear drop on it, nor now increase, changes affected just GEN8 & GEN9).

What you propose should be done about T-Rex & PSPhong:
* Close this bug now, as the clear regression was fixed, or
* Keep this bug open until Curro's scheduling rework lands, and/or somebody adds register bank conflict handling to Mesa
?

Of latter, register bank conflict handling should be clear win (especially on older platforms), but scheduling is all about compromises, so some tests can go down while on average things improve.

Comment 18 Timothy Arceri 2017-01-05 23:07:04 UTC

(In reply to Eero Tamminen from comment #17)
> The commit fixes PSPom perf on all platforms where I have data (on GEN6 &
> GEN7 there was no clear drop on it, nor now increase, changes affected just
> GEN8 & GEN9).
> 
> What you propose should be done about T-Rex & PSPhong:
> * Close this bug now, as the clear regression was fixed, or
> * Keep this bug open until Curro's scheduling rework lands, and/or somebody
> adds register bank conflict handling to Mesa
> ?
> 

It's up to you. I won't be looking into this further since the real regression is fixed and the other dips in performance are just due to things shuffling around due to the different order in which we are calling things.

If you want to wait for the scheduling rework to test again I'm happy for you to leave the bug open.

Comment 19 Kenneth Graunke 2017-01-07 01:12:38 UTC

I don't think it's worth tracking this as a regression if we're mostly interested in adding register banking info and better scheduling...we may as well just track those as future improvements we need to make.

Comment 20 Eero Tamminen 2017-01-09 11:08:36 UTC

CSDof perf drops are now fixed too, on several platforms Mesa is now faster than before (Jason: "i965/compiler: Use the new nir_opt_copy_prop_vars pass").

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.