Patch series culminating in this commit changed performance in several tests: ------------------------------------------------------- commit 40e9f2f13847ddd94e1216088aa00456d7b02d2b Author: Timothy Arceri <timothy.arceri@collabora.com> AuthorDate: Tue Dec 13 11:37:25 2016 +1100 Commit: Timothy Arceri <timothy.arceri@collabora.com> CommitDate: Fri Dec 23 10:15:36 2016 +1100 i965: disable loop unrolling in GLSL IR There is a single regression in loop unrolling which is: loops HURT: shaders/orbital_explorer.shader_test GS SIMD8: 0 -> 1 However the loop is huge so it seems reasonable not to unroll it. It's surprising that GLSL IR does unroll it. ------------------------------------------------------- On SKL i5-6600K (GT2), the changes were following (in FullHD size): Performance dropped due to "disable loop unrolling in GLSL IR": - 2.7% SynMark PSPom - 2.3% SynMark PSPhong - 2.2% GfxBench T-Rex (GL version) - 0.5% SynMark PSBump8 Performance increased due to "use nir loop unrolling pass": + 12.5% SynMark ShMapPcf + 3.9% SynMark CSDof (+8.4% from "use nir_lower_indirect_derefs() for GLSL") + 2.2% SynMark DevRes (composite test including other affected tests) + 0.5% Unigine Valley + 0.5% SynMark PSBump2 Results are similar on other platforms, except for CSDof where results depend a lot on the HW. GfxBench is v4.0, SynMark v7.0. CSDof performance increased a lot on SKL GT2, KBL GT2 and a bit BYT, but its perf dropped a lot SKL GT3e & GT4e, BDW GT3 & GT2, BSW, and a bit on HSW GT3e and BXT. After these 2 changes, SynMark shader compilation speed test is ~25% faster on all platforms which is pretty good.
Unfortunately I don't have access to these benchmarks so its a little hard to know what caused the regressions. It's possible Ken's copy propagation fix [1] could help. Or possibly the Curro's upcoming work on the scheduler. [1] https://patchwork.freedesktop.org/patch/127114/
Oh I see GFXBench is free to download. Will check it out.
(In reply to Timothy Arceri from comment #2) > Oh I see GFXBench is free to download. Will check it out. Note: the free version doesn't support automation, so you can just get the shaders, any perf testing unfortunately would need to be manual. (Once you've verified a set of patches that help with T-Rex shaders, I can do automated runs to check the perf for the affected benchmarks.) I went through all the T-Rex GLSL shaders. They contain a huge amount of dead code and are in the end fairly simple. Only little control flow, and if one ignores the dead code, they're fairly short. Several use discard though. The only shader with loop in it is this motion blur fragment shader (which is known to have clear impact on the test performance): ------------------------ uniform sampler2D texture_unit0; uniform sampler2D texture_unit1; varying vec2 out_texcoord0; void main() { const int n = 4; vec3 motion = texture2D(texture_unit1, out_texcoord0).xyz; vec4 texel = texture2D(texture_unit0, out_texcoord0); vec3 color = texel.xyz / float (n); motion.xy = (motion.xy - 0.5) * vec2(0.0666667, .125); motion.xy *= texel.a; for (int i = 1; i < n; i++) { vec2 tc = out_texcoord0 - motion.xy * float (i); color += texture2D(texture_unit0, tc).xyz / float (n); } gl_FragColor.xyz = color; } ------------------------ (I removed all the dead code, uniforms, varyings etc.) All the other loops in T-Rex shaders are dead code.
(To find the full motion blur shader from your T-Rex dumps, just grep for "motion", it's the only one using a variable by that name.)
Created attachment 128692 [details] TREX loop unrolled in GLSL IR
Created attachment 128693 [details] TREX loop unrolled in NIR
I've attached dumps from unrolling the loops with the different passes. It doesn't look like there is any easy solution, both have the same instruction count it looks like the result comes down to scheduling.
(In reply to Timothy Arceri from comment #7) > I've attached dumps from unrolling the loops with the different passes. It > doesn't look like there is any easy solution, both have the same instruction > count it looks like the result comes down to scheduling. NIR version schedules the second last "send" worse, and it has more register bank conflicts with "mad" instructions. Is it possible that the patch would affect anything else besides shaders with (non-dead) loops?
SynMark PSPhong case seems similar (worse scheduling and more register bank conflicts). In the SynMark PSPom case, fragment shader has two loops of which GLSL IR can unroll the other, but NIR doesn't unroll either of them. Should NIR be able to unroll loops with non-integer loop counter, like GLSL IR did: ------------------------ for (float i = 0.02; i < 0.9; i += 0.11) maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 - i) * shadowSoftening; ------------------------ ?
(In reply to Eero Tamminen from comment #9) > SynMark PSPhong case seems similar (worse scheduling and more register bank > conflicts). > > In the SynMark PSPom case, fragment shader has two loops of which GLSL IR > can unroll the other, but NIR doesn't unroll either of them. > > Should NIR be able to unroll loops with non-integer loop counter, like GLSL > IR did: > ------------------------ > for (float i = 0.02; i < 0.9; i += 0.11) > maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 - > i) * shadowSoftening; > ------------------------ > ? This is going to get really confusing really quickly is we're trying to debug 5 different regressions in the same bug. I recommend filing new bugs for each on that we're analyzing and making them block a TRACKER bug.
(In reply to Matt Turner from comment #10) > (In reply to Eero Tamminen from comment #9) > > SynMark PSPhong case seems similar (worse scheduling and more register bank > > conflicts). > > > > In the SynMark PSPom case, fragment shader has two loops of which GLSL IR > > can unroll the other, but NIR doesn't unroll either of them. > > > > Should NIR be able to unroll loops with non-integer loop counter, like GLSL > > IR did: > > ------------------------ > > for (float i = 0.02; i < 0.9; i += 0.11) > > maxSh += (texture(normalTex, texCoord + lightRay * i).a - sh0) * (1.0 - > > i) * shadowSoftening; > > ------------------------ > > ? Few other things that in shader-db had non-integer loop counters with fixed counts: - Invisible Inc - Talos Principle - Steam Big Picture - Serious Sam 3 Manhattan, Unigine demos (at Ultra level), Nexuiz and Xonotic had float loops without fixed count. > This is going to get really confusing really quickly is we're trying to > debug 5 different regressions in the same bug. I recommend filing new bugs > for each on that we're analyzing and making them block a TRACKER bug. While there are 4 benchmarks that are known to be regressed, so far it's one issue, and potential issue with float counter loops. Does/should NIR support unrolling non-integer loop counters?
(In reply to Eero Tamminen from comment #11) > Does/should NIR support unrolling non-integer loop counters? Yes it should. I'll take a look at the example you gave tomorrow to see what is going wrong.
PSBump8 case can be ignored. While there are some scheduling differences, there are no changes for send, nor change in instruction counts, and the perf change is close to variance Additionally, perf of PSBump2, which is otherwise same test but just has smaller loop count, went into other direction.
Created attachment 128716 [details] [review] Fix loop interation count for floats Does this help with unrolling the loop?
Yes, with that, the float loop is unrolled and PSPom perf regression is fixed. This: SIMD16 shader: 156 instructions. 2 loops. 6802 cycles. 0:0 spills:fills. Promoted 4 constants. Compacted 2496 to 1552 bytes (38%) Changed to: SIMD16 shader: 197 instructions. 1 loops. 5924 cycles. 0:0 spills:fills. Promoted 17 constants. Compacted 3152 to 2176 bytes (31%)
(In reply to Eero Tamminen from comment #15) > Yes, with that, the float loop is unrolled and PSPom perf regression is > fixed. > Cool. I've pushed that fix. commit 4b7dfd881296a542a0c08a12c27f643dabd7280c Author: Timothy Arceri <timothy.arceri@collabora.com> Date: Tue Jan 3 12:03:54 2017 +1100 nir: fix loop iteration count calculation for floats Fixes performance regression in SynMark PSPom caused by loops with float counters not always unrolling. For example: for (float i = 0.02; i < 0.9; i += 0.11) ... Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
The commit fixes PSPom perf on all platforms where I have data (on GEN6 & GEN7 there was no clear drop on it, nor now increase, changes affected just GEN8 & GEN9). What you propose should be done about T-Rex & PSPhong: * Close this bug now, as the clear regression was fixed, or * Keep this bug open until Curro's scheduling rework lands, and/or somebody adds register bank conflict handling to Mesa ? Of latter, register bank conflict handling should be clear win (especially on older platforms), but scheduling is all about compromises, so some tests can go down while on average things improve.
(In reply to Eero Tamminen from comment #17) > The commit fixes PSPom perf on all platforms where I have data (on GEN6 & > GEN7 there was no clear drop on it, nor now increase, changes affected just > GEN8 & GEN9). > > What you propose should be done about T-Rex & PSPhong: > * Close this bug now, as the clear regression was fixed, or > * Keep this bug open until Curro's scheduling rework lands, and/or somebody > adds register bank conflict handling to Mesa > ? > It's up to you. I won't be looking into this further since the real regression is fixed and the other dips in performance are just due to things shuffling around due to the different order in which we are calling things. If you want to wait for the scheduling rework to test again I'm happy for you to leave the bug open.
I don't think it's worth tracking this as a regression if we're mostly interested in adding register banking info and better scheduling...we may as well just track those as future improvements we need to make.
CSDof perf drops are now fixed too, on several platforms Mesa is now faster than before (Jason: "i965/compiler: Use the new nir_opt_copy_prop_vars pass").
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.