Bug 109216 - 4-27% performance drop in Vulkan benchmarks
Summary: 4-27% performance drop in Vulkan benchmarks
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Ian Romanick
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: bisected, regression
Depends on:
Blocks: mesa-19.0 109582
  Show dependency treegraph
 
Reported: 2019-01-03 17:09 UTC by Eero Tamminen
Modified: 2019-04-15 17:27 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Avoid Vsync / Mailbox mode in fullscreen (1.00 KB, patch)
2019-01-03 17:09 UTC, Eero Tamminen
Details | Splinter Review
Test case that reproduces the loop unrolling problem in GLSL. (630 bytes, text/plain)
2019-01-09 20:58 UTC, Ian Romanick
Details
intel/compiler: Try nir_opt_if once before nir_opt_peephole_select (2.82 KB, patch)
2019-01-09 23:52 UTC, Ian Romanick
Details | Splinter Review

Description Eero Tamminen 2019-01-03 17:09:53 UTC
Created attachment 142959 [details] [review]
Avoid Vsync / Mailbox mode in fullscreen

Setup:
* Ubuntu 18.04 / Unity desktop
* Mesa git version
* X server & drm-tip kernel built from git within last couple of months
* Modifiers enabled in X server configuration ("dmabuf_capable")


Vulkan performance noticeably dropped on all Vulkan supporting platforms (at least from HSW to CFL) between following Mesa commits:
2018-12-17 17:52:23 00e2cbc049 v3d: Fix the argument type for vir_BRANCH()
2018-12-18 18:47:54 29e4b949b4 Revert "nir/lower_indirect: Bail early if modes == 0"

E.g. on KBL 7500U (GT2), performance drops were following:
- 27% Deferred Multisampling (*)
- 9% Compute NBody (*)
- 6% Compute Raytracing (*)
- 4% GfxBench Vulkan Aztec Ruins normal

(*) Sacha Willems' Vulkan demo, run at FullHD fullscreen, with Vsync disabled i.e. NOT using mailbox mode as that seems to get Vsynched although Vsync is supposed to be disabled (see attachment). For example:
 ./deferredmultisampling --fullscreen --benchmark --benchwarmup 3 --benchruntime 20

E.g. in Deferred Multisampling case; GPU, CPU and memory power usage drops clearly with the performance, so I assume it's become more synchronous.

Performance didn't drop in following of the Vulkan demos:
* Triangle (so it's unlikely to be related to modifiers)
* Multithreading
* Indirect Draw

(Don't have data for the rest of the Vulkan demos.)
Comment 1 Lionel Landwerlin 2019-01-07 13:56:38 UTC
Ian: I wonder if that could be a result of this series : https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9
Comment 2 Eero Tamminen 2019-01-07 15:25:51 UTC
(In reply to Lionel Landwerlin from comment #1)
> Ian: I wonder if that could be a result of this series :
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9

This series doesn't seem Vulkan specific.

There was no drop in GL or GLES version of Aztec Ruins, only in Vulkan version of AztecRuins (*).  And no drop in other GL benchmarks, so I think the regression is Vulkan specific.

(*) While Vulkan version runs windowed (because GfxBench doesn't seem to support fullscreen for Vulkan), and GL & GLES versions are fullscreen, I don't think that test being windowed is relevant because:
- the other regressing Vulkan tests are fullscreen
- AztecRuins has low FPS

Were you able to reproduce e.g. Deferred Multisampling Vulkan demo drop?
Comment 3 Lionel Landwerlin 2019-01-07 15:43:29 UTC
(In reply to Eero Tamminen from comment #2)
> (In reply to Lionel Landwerlin from comment #1)
> > Ian: I wonder if that could be a result of this series :
> > https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9
> 
> This series doesn't seem Vulkan specific.
> 
> There was no drop in GL or GLES version of Aztec Ruins, only in Vulkan
> version of AztecRuins (*).  And no drop in other GL benchmarks, so I think
> the regression is Vulkan specific.
> 
> (*) While Vulkan version runs windowed (because GfxBench doesn't seem to
> support fullscreen for Vulkan), and GL & GLES versions are fullscreen, I
> don't think that test being windowed is relevant because:
> - the other regressing Vulkan tests are fullscreen
> - AztecRuins has low FPS
> 
> Were you able to reproduce e.g. Deferred Multisampling Vulkan demo drop?

Yep, bisecting now just to verify.
Comment 4 Lionel Landwerlin 2019-01-07 15:55:38 UTC
Bisected to :

commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Tue May 22 18:56:41 2018 -0700

    intel/compiler: More peephole select
    
    Shader-db results:
    
    The one shader hurt for instructions is a compute shader that had both
    spills and fills hurt.
Comment 5 Eero Tamminen 2019-01-07 16:38:30 UTC
(In reply to Lionel Landwerlin from comment #4)
> commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> Author: Ian Romanick <ian.d.romanick@intel.com>
> Date:   Tue May 22 18:56:41 2018 -0700
> 
>     intel/compiler: More peephole select
>     
>     Shader-db results:
>     
>     The one shader hurt for instructions is a compute shader that had both
>     spills and fills hurt.

I think shaders in Sacha Willems' Vulkan demos would also need to be added to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo doesn't have any compute shaders or spills/fills, and it regressed most.
Comment 6 Lionel Landwerlin 2019-01-07 16:44:15 UTC
(In reply to Lionel Landwerlin from comment #4)
> Bisected to :
> 
> commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> Author: Ian Romanick <ian.d.romanick@intel.com>
> Date:   Tue May 22 18:56:41 2018 -0700
> 
>     intel/compiler: More peephole select
>     
>     Shader-db results:
>     
>     The one shader hurt for instructions is a compute shader that had both
>     spills and fills hurt.

This significantly impacts one a couple of the shaders of the multisampling demo :

First one :
Before: SIMD16 shader: 356 instructions. 0 loops. 1812 cycles. 0:0 spills:fills. Promoted 2 constants. Compacted 5696 to 3488 bytes (39%)
After: SIMD16 shader: 157 instructions. 3 loops. 99970 cycles. 0:0 spills:fills. Promoted 2 constants. Compacted 2512 to 1536 bytes (39%)

Second one:
Before: SIMD16 shader: 249 instructions. 0 loops. 1170 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 3984 to 2560 bytes (36%)
After: SIMD16 shader: 102 instructions. 3 loops. 4580 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 1632 to 1088 bytes (33%)

This is one of the shader : https://github.com/SaschaWillems/Vulkan/blob/master/data/shaders/deferredmultisampling/deferred.frag
Comment 7 Lionel Landwerlin 2019-01-07 16:44:37 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Lionel Landwerlin from comment #4)
> > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > Author: Ian Romanick <ian.d.romanick@intel.com>
> > Date:   Tue May 22 18:56:41 2018 -0700
> > 
> >     intel/compiler: More peephole select
> >     
> >     Shader-db results:
> >     
> >     The one shader hurt for instructions is a compute shader that had both
> >     spills and fills hurt.
> 
> I think shaders in Sacha Willems' Vulkan demos would also need to be added
> to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> doesn't have any compute shaders or spills/fills, and it regressed most.

Totally, about to do that.
Comment 8 Lionel Landwerlin 2019-01-07 17:59:18 UTC
(In reply to Lionel Landwerlin from comment #7)
> (In reply to Eero Tamminen from comment #5)
> > (In reply to Lionel Landwerlin from comment #4)
> > > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > > Author: Ian Romanick <ian.d.romanick@intel.com>
> > > Date:   Tue May 22 18:56:41 2018 -0700
> > > 
> > >     intel/compiler: More peephole select
> > >     
> > >     Shader-db results:
> > >     
> > >     The one shader hurt for instructions is a compute shader that had both
> > >     spills and fills hurt.
> > 
> > I think shaders in Sacha Willems' Vulkan demos would also need to be added
> > to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> > doesn't have any compute shaders or spills/fills, and it regressed most.
> 
> Totally, about to do that.

Branch here : https://gitlab.freedesktop.org/llandwerlin/shader-db/commits/master 

Need to fix the stub later.
Comment 9 Ian Romanick 2019-01-09 20:56:19 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Lionel Landwerlin from comment #4)
> > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > Author: Ian Romanick <ian.d.romanick@intel.com>
> > Date:   Tue May 22 18:56:41 2018 -0700
> > 
> >     intel/compiler: More peephole select
> >     
> >     Shader-db results:
> >     
> >     The one shader hurt for instructions is a compute shader that had both
> >     spills and fills hurt.
> 
> I think shaders in Sacha Willems' Vulkan demos would also need to be added
> to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> doesn't have any compute shaders or spills/fills, and it regressed most.

Adding the GLSL shader won't actually help this problem.  We'd have to add the SPIR-V instead.  The issue is the weird way that glslang generates for-loops.  I have a test case that reproduces the issue in plain GLSL.
Comment 10 Ian Romanick 2019-01-09 20:58:09 UTC
Created attachment 143043 [details]
Test case that reproduces the loop unrolling problem in GLSL.
Comment 11 Ian Romanick 2019-01-09 21:01:15 UTC
There was also some discussion about the underlying issue in bug #109081.
Comment 12 Lionel Landwerlin 2019-01-09 21:23:40 UTC
There is also https://github.com/airlied/pipeline-db to capture.
Comment 13 Ian Romanick 2019-01-09 21:41:33 UTC
For my own notes: The optimization that needs to be modified to also handle bcsel is opt_peel_loop_initial_if in nir_opt_if.c.
Comment 14 Ian Romanick 2019-01-09 23:52:21 UTC
Created attachment 143048 [details] [review]
intel/compiler: Try nir_opt_if once before nir_opt_peephole_select

Modifying opt_peel_loop_initial_if to work on bcsel in an manner similar to how it operates on if-statements (i.e., moving all of the instructions that are only used to support the sources of the bcsel to the end of the loop) is going to be non-trivial.  In the mean time, this fixes the problem in the small test case that I supplied.  As soon as it's done running through the CI, I will submit an MR.
Comment 15 Eero Tamminen 2019-01-11 15:43:37 UTC
(In reply to Ian Romanick from comment #14)
> Created attachment 143048 [details] [review] [review]
> intel/compiler: Try nir_opt_if once before nir_opt_peephole_select

This patch fixes the smaller perf regressions in (nbody & raytracing) compute tests, but it doesn't help at all with the largest regression, in Deferred Multisampling.

(Aztec Ruins I couldn't test due to bug 109304.)
Comment 16 Ian Romanick 2019-01-11 18:41:16 UTC
(In reply to Eero Tamminen from comment #15)
> (In reply to Ian Romanick from comment #14)
> > Created attachment 143048 [details] [review] [review] [review]
> > intel/compiler: Try nir_opt_if once before nir_opt_peephole_select
> 
> This patch fixes the smaller perf regressions in (nbody & raytracing)
> compute tests, but it doesn't help at all with the largest regression, in
> Deferred Multisampling.

It appears that the inner loop still isn't getting unrolled because of the problem this patch should have resolved.  I'll keep digging.
Comment 17 Ian Romanick 2019-01-12 02:16:23 UTC
I put up a WIP: merge request.  This series fixes the issue in the Sascha Willems demos for me.  I get the same number of loops with this series as there was before 8fb8ebfbb05d.

There's a one instruction regression in the deferredmultisampling demo compared to 8fb8ebfbb05d~1, but I can live with that.

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107

I can't get a successful CI run even with Mesa master right now, so this is definitely not fully tested.
Comment 18 Eero Tamminen 2019-01-14 08:44:17 UTC
(In reply to Ian Romanick from comment #17)
> I put up a WIP: merge request.  This series fixes the issue in the Sascha
> Willems demos for me.  I get the same number of loops with this series as
> there was before 8fb8ebfbb05d.
> 
> There's a one instruction regression in the deferredmultisampling demo
> compared to 8fb8ebfbb05d~1, but I can live with that.
> 
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107
> 
> I can't get a successful CI run even with Mesa master right now, so this is
> definitely not fully tested.

Your branch includes bug 109304 so it won't work with Aztec Ruins (which also regressed).

If you can provide patch series I can apply on top of latest Mesa, I could do full test with it.
Comment 19 Eero Tamminen 2019-01-14 11:00:11 UTC
Sorry, I somehow missed the download link (I haven't really looked at gitlab MRs before):
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107.patch

Performance-wise your changes mostly fix the regressing test-cases.

There are still following gaps:
* Deferred Multisampling: 0.5% (down from 27%)
* Compute N-Body: 3% (down from 9%)

Aztec Ruins and Compute Raytracing gaps seem to have been fixed to within daily variation.

While I think this is fairly acceptable, could you check also Compute N-Body, why it still has a gap?
Comment 20 Ian Romanick 2019-01-14 18:00:58 UTC
(In reply to Eero Tamminen from comment #19)
> Sorry, I somehow missed the download link (I haven't really looked at gitlab
> MRs before):
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107.patch
> 
> Performance-wise your changes mostly fix the regressing test-cases.
> 
> There are still following gaps:
> * Deferred Multisampling: 0.5% (down from 27%)
> * Compute N-Body: 3% (down from 9%)

I was not able to reproduce this on my system.  With system Mesa and the branch from the MR I get 34-35 fps.  I looked at the assembly generated before 8fb8ebfbb05d and after this series, and, aside from slightly different register assignment, there's no difference.
Comment 21 Eero Tamminen 2019-01-16 11:26:28 UTC
N-Body test has larger daily variance than the other tests.  Looking at that for last ~3 month, it can be up to 3% i.e. the gap I saw.  As you don't see any gap in this test, I guess I just got a kernel boot where allocation alignments happened so that perf was on the lower side of the variance.

I.e. from my side, everything is good.  I'll recheck the perf once the fix is in Mesa.
Comment 22 Ian Romanick 2019-01-29 20:10:43 UTC
The previous merge request had a few problems that caused a bunch of failures in Vulkan CTS.  I have submitted a new MR that represents a slightly different approach.  This seems to restore the performance of the Sascha Willems demos and passes Vulkan CTS.

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/170
Comment 23 Jason Ekstrand 2019-02-07 21:28:06 UTC
I'm beginning to think we need to just revert the offending patch on the 19.0 branch.  There are solutions to the problem but they involve either reworking NIR loops to have explicit continue blocks or very tricky optimizations.  Neither of those (especially not the former) is something that I'd be super-excited to back-port.
Comment 24 Ian Romanick 2019-02-08 00:26:28 UTC
(In reply to Jason Ekstrand from comment #23)
> I'm beginning to think we need to just revert the offending patch on the
> 19.0 branch.  There are solutions to the problem but they involve either
> reworking NIR loops to have explicit continue blocks or very tricky
> optimizations.  Neither of those (especially not the former) is something
> that I'd be super-excited to back-port.

There is a mostly reviewed MR that I just need to finish.  I'm really, sorry but since I've been back from FOSDEM, I've been very ill.  If this was such a pressing issue for you, you could have also reviewed that patch series when I sent it out and tagged you on it. :(
Comment 25 Jason Ekstrand 2019-02-08 19:37:36 UTC
(In reply to Ian Romanick from comment #24)
> There is a mostly reviewed MR that I just need to finish.  I'm really, sorry
> but since I've been back from FOSDEM, I've been very ill.  If this was such
> a pressing issue for you, you could have also reviewed that patch series
> when I sent it out and tagged you on it. :(

I'm sorry if my comments have been a bit curt and I haven't expressed my meaning fully.  The difficulty for me isn't just the urgency (Vulkan basically has no loop unrolling right now) but that I'm not convinced the currently proposed fixes add up to a complete fix.  What we had before worked and it did so very reliably as long as the loop contained no non-trivial continues.  With 8fb8ebfbb05d33, loops now frequently end up with a different form that makes unrolling more difficult.  Because of the nature of the change to the IR that can result from peephole select eating the continue block if statement, I'm not actually convinced that the sum total of the proposed fixes is enough to fix it in general.

The proper fix for this is probably to do something smarter with continues in general.  I'm not denying the utility of 8fb8ebfbb05d33; any patch which gets > 0.1% boost in shader-db is pretty impressive.  I'm just claiming that there are clearly problems, things aren't baked well enough, and we may need to step back and try again.

As a side-note, it's very clear that not having SPIR-V support in shader-db is a serious problem.  There is vkpipeline-db but I'm not sure if that's really what we want as it records all the pipelines and doesn't do any de-duplication.
Comment 26 Eero Tamminen 2019-02-11 17:31:25 UTC
Ian's patches were merged during weekend, and fix all the regressions indicated here.
Comment 27 Ian Romanick 2019-02-11 18:51:04 UTC
If we decide to revert the original patches from the 19.0 branch, I won't fight it.  If that is ultimately the decision, the following commits will all need to be reverted, in this order:

af07141b33d0a58ed2cfe915b95f146481a4ffef
378f9967710e9145f2a4f8eee89d87badbe0e6ea
8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
Comment 28 Timothy Arceri 2019-02-11 22:04:55 UTC
(In reply to Jason Ekstrand from comment #25)

> As a side-note, it's very clear that not having SPIR-V support in shader-db
> is a serious problem.  There is vkpipeline-db but I'm not sure if that's
> really what we want as it records all the pipelines and doesn't do any
> de-duplication.

Shouldn't it be possible to write a tool and add it to the vkpipeline-db project that compares shaders and removes duplicates?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.