Bug 109216 - 4-27% performance drop in Vulkan benchmarks
Summary: 4-27% performance drop in Vulkan benchmarks
Status: ASSIGNED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Ian Romanick
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-03 17:09 UTC by Eero Tamminen
Modified: 2019-01-16 11:26 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Avoid Vsync / Mailbox mode in fullscreen (1.00 KB, patch)
2019-01-03 17:09 UTC, Eero Tamminen
Details | Splinter Review
Test case that reproduces the loop unrolling problem in GLSL. (630 bytes, text/plain)
2019-01-09 20:58 UTC, Ian Romanick
Details
intel/compiler: Try nir_opt_if once before nir_opt_peephole_select (2.82 KB, patch)
2019-01-09 23:52 UTC, Ian Romanick
Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2019-01-03 17:09:53 UTC
Created attachment 142959 [details] [review]
Avoid Vsync / Mailbox mode in fullscreen

Setup:
* Ubuntu 18.04 / Unity desktop
* Mesa git version
* X server & drm-tip kernel built from git within last couple of months
* Modifiers enabled in X server configuration ("dmabuf_capable")


Vulkan performance noticeably dropped on all Vulkan supporting platforms (at least from HSW to CFL) between following Mesa commits:
2018-12-17 17:52:23 00e2cbc049 v3d: Fix the argument type for vir_BRANCH()
2018-12-18 18:47:54 29e4b949b4 Revert "nir/lower_indirect: Bail early if modes == 0"

E.g. on KBL 7500U (GT2), performance drops were following:
- 27% Deferred Multisampling (*)
- 9% Compute NBody (*)
- 6% Compute Raytracing (*)
- 4% GfxBench Vulkan Aztec Ruins normal

(*) Sacha Willems' Vulkan demo, run at FullHD fullscreen, with Vsync disabled i.e. NOT using mailbox mode as that seems to get Vsynched although Vsync is supposed to be disabled (see attachment). For example:
 ./deferredmultisampling --fullscreen --benchmark --benchwarmup 3 --benchruntime 20

E.g. in Deferred Multisampling case; GPU, CPU and memory power usage drops clearly with the performance, so I assume it's become more synchronous.

Performance didn't drop in following of the Vulkan demos:
* Triangle (so it's unlikely to be related to modifiers)
* Multithreading
* Indirect Draw

(Don't have data for the rest of the Vulkan demos.)
Comment 1 Lionel Landwerlin 2019-01-07 13:56:38 UTC
Ian: I wonder if that could be a result of this series : https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9
Comment 2 Eero Tamminen 2019-01-07 15:25:51 UTC
(In reply to Lionel Landwerlin from comment #1)
> Ian: I wonder if that could be a result of this series :
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9

This series doesn't seem Vulkan specific.

There was no drop in GL or GLES version of Aztec Ruins, only in Vulkan version of AztecRuins (*).  And no drop in other GL benchmarks, so I think the regression is Vulkan specific.

(*) While Vulkan version runs windowed (because GfxBench doesn't seem to support fullscreen for Vulkan), and GL & GLES versions are fullscreen, I don't think that test being windowed is relevant because:
- the other regressing Vulkan tests are fullscreen
- AztecRuins has low FPS

Were you able to reproduce e.g. Deferred Multisampling Vulkan demo drop?
Comment 3 Lionel Landwerlin 2019-01-07 15:43:29 UTC
(In reply to Eero Tamminen from comment #2)
> (In reply to Lionel Landwerlin from comment #1)
> > Ian: I wonder if that could be a result of this series :
> > https://gitlab.freedesktop.org/mesa/mesa/merge_requests/9
> 
> This series doesn't seem Vulkan specific.
> 
> There was no drop in GL or GLES version of Aztec Ruins, only in Vulkan
> version of AztecRuins (*).  And no drop in other GL benchmarks, so I think
> the regression is Vulkan specific.
> 
> (*) While Vulkan version runs windowed (because GfxBench doesn't seem to
> support fullscreen for Vulkan), and GL & GLES versions are fullscreen, I
> don't think that test being windowed is relevant because:
> - the other regressing Vulkan tests are fullscreen
> - AztecRuins has low FPS
> 
> Were you able to reproduce e.g. Deferred Multisampling Vulkan demo drop?

Yep, bisecting now just to verify.
Comment 4 Lionel Landwerlin 2019-01-07 15:55:38 UTC
Bisected to :

commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Tue May 22 18:56:41 2018 -0700

    intel/compiler: More peephole select
    
    Shader-db results:
    
    The one shader hurt for instructions is a compute shader that had both
    spills and fills hurt.
Comment 5 Eero Tamminen 2019-01-07 16:38:30 UTC
(In reply to Lionel Landwerlin from comment #4)
> commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> Author: Ian Romanick <ian.d.romanick@intel.com>
> Date:   Tue May 22 18:56:41 2018 -0700
> 
>     intel/compiler: More peephole select
>     
>     Shader-db results:
>     
>     The one shader hurt for instructions is a compute shader that had both
>     spills and fills hurt.

I think shaders in Sacha Willems' Vulkan demos would also need to be added to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo doesn't have any compute shaders or spills/fills, and it regressed most.
Comment 6 Lionel Landwerlin 2019-01-07 16:44:15 UTC
(In reply to Lionel Landwerlin from comment #4)
> Bisected to :
> 
> commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> Author: Ian Romanick <ian.d.romanick@intel.com>
> Date:   Tue May 22 18:56:41 2018 -0700
> 
>     intel/compiler: More peephole select
>     
>     Shader-db results:
>     
>     The one shader hurt for instructions is a compute shader that had both
>     spills and fills hurt.

This significantly impacts one a couple of the shaders of the multisampling demo :

First one :
Before: SIMD16 shader: 356 instructions. 0 loops. 1812 cycles. 0:0 spills:fills. Promoted 2 constants. Compacted 5696 to 3488 bytes (39%)
After: SIMD16 shader: 157 instructions. 3 loops. 99970 cycles. 0:0 spills:fills. Promoted 2 constants. Compacted 2512 to 1536 bytes (39%)

Second one:
Before: SIMD16 shader: 249 instructions. 0 loops. 1170 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 3984 to 2560 bytes (36%)
After: SIMD16 shader: 102 instructions. 3 loops. 4580 cycles. 0:0 spills:fills. Promoted 0 constants. Compacted 1632 to 1088 bytes (33%)

This is one of the shader : https://github.com/SaschaWillems/Vulkan/blob/master/data/shaders/deferredmultisampling/deferred.frag
Comment 7 Lionel Landwerlin 2019-01-07 16:44:37 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Lionel Landwerlin from comment #4)
> > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > Author: Ian Romanick <ian.d.romanick@intel.com>
> > Date:   Tue May 22 18:56:41 2018 -0700
> > 
> >     intel/compiler: More peephole select
> >     
> >     Shader-db results:
> >     
> >     The one shader hurt for instructions is a compute shader that had both
> >     spills and fills hurt.
> 
> I think shaders in Sacha Willems' Vulkan demos would also need to be added
> to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> doesn't have any compute shaders or spills/fills, and it regressed most.

Totally, about to do that.
Comment 8 Lionel Landwerlin 2019-01-07 17:59:18 UTC
(In reply to Lionel Landwerlin from comment #7)
> (In reply to Eero Tamminen from comment #5)
> > (In reply to Lionel Landwerlin from comment #4)
> > > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > > Author: Ian Romanick <ian.d.romanick@intel.com>
> > > Date:   Tue May 22 18:56:41 2018 -0700
> > > 
> > >     intel/compiler: More peephole select
> > >     
> > >     Shader-db results:
> > >     
> > >     The one shader hurt for instructions is a compute shader that had both
> > >     spills and fills hurt.
> > 
> > I think shaders in Sacha Willems' Vulkan demos would also need to be added
> > to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> > doesn't have any compute shaders or spills/fills, and it regressed most.
> 
> Totally, about to do that.

Branch here : https://gitlab.freedesktop.org/llandwerlin/shader-db/commits/master 

Need to fix the stub later.
Comment 9 Ian Romanick 2019-01-09 20:56:19 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Lionel Landwerlin from comment #4)
> > commit 8fb8ebfbb05d3323481c8ba6d320b3a3580bad99
> > Author: Ian Romanick <ian.d.romanick@intel.com>
> > Date:   Tue May 22 18:56:41 2018 -0700
> > 
> >     intel/compiler: More peephole select
> >     
> >     Shader-db results:
> >     
> >     The one shader hurt for instructions is a compute shader that had both
> >     spills and fills hurt.
> 
> I think shaders in Sacha Willems' Vulkan demos would also need to be added
> to shader-db.  According to INTEL_DEBUG output, Deferred Multisampling demo
> doesn't have any compute shaders or spills/fills, and it regressed most.

Adding the GLSL shader won't actually help this problem.  We'd have to add the SPIR-V instead.  The issue is the weird way that glslang generates for-loops.  I have a test case that reproduces the issue in plain GLSL.
Comment 10 Ian Romanick 2019-01-09 20:58:09 UTC
Created attachment 143043 [details]
Test case that reproduces the loop unrolling problem in GLSL.
Comment 11 Ian Romanick 2019-01-09 21:01:15 UTC
There was also some discussion about the underlying issue in bug #109081.
Comment 12 Lionel Landwerlin 2019-01-09 21:23:40 UTC
There is also https://github.com/airlied/pipeline-db to capture.
Comment 13 Ian Romanick 2019-01-09 21:41:33 UTC
For my own notes: The optimization that needs to be modified to also handle bcsel is opt_peel_loop_initial_if in nir_opt_if.c.
Comment 14 Ian Romanick 2019-01-09 23:52:21 UTC
Created attachment 143048 [details] [review]
intel/compiler: Try nir_opt_if once before nir_opt_peephole_select

Modifying opt_peel_loop_initial_if to work on bcsel in an manner similar to how it operates on if-statements (i.e., moving all of the instructions that are only used to support the sources of the bcsel to the end of the loop) is going to be non-trivial.  In the mean time, this fixes the problem in the small test case that I supplied.  As soon as it's done running through the CI, I will submit an MR.
Comment 15 Eero Tamminen 2019-01-11 15:43:37 UTC
(In reply to Ian Romanick from comment #14)
> Created attachment 143048 [details] [review] [review]
> intel/compiler: Try nir_opt_if once before nir_opt_peephole_select

This patch fixes the smaller perf regressions in (nbody & raytracing) compute tests, but it doesn't help at all with the largest regression, in Deferred Multisampling.

(Aztec Ruins I couldn't test due to bug 109304.)
Comment 16 Ian Romanick 2019-01-11 18:41:16 UTC
(In reply to Eero Tamminen from comment #15)
> (In reply to Ian Romanick from comment #14)
> > Created attachment 143048 [details] [review] [review] [review]
> > intel/compiler: Try nir_opt_if once before nir_opt_peephole_select
> 
> This patch fixes the smaller perf regressions in (nbody & raytracing)
> compute tests, but it doesn't help at all with the largest regression, in
> Deferred Multisampling.

It appears that the inner loop still isn't getting unrolled because of the problem this patch should have resolved.  I'll keep digging.
Comment 17 Ian Romanick 2019-01-12 02:16:23 UTC
I put up a WIP: merge request.  This series fixes the issue in the Sascha Willems demos for me.  I get the same number of loops with this series as there was before 8fb8ebfbb05d.

There's a one instruction regression in the deferredmultisampling demo compared to 8fb8ebfbb05d~1, but I can live with that.

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107

I can't get a successful CI run even with Mesa master right now, so this is definitely not fully tested.
Comment 18 Eero Tamminen 2019-01-14 08:44:17 UTC
(In reply to Ian Romanick from comment #17)
> I put up a WIP: merge request.  This series fixes the issue in the Sascha
> Willems demos for me.  I get the same number of loops with this series as
> there was before 8fb8ebfbb05d.
> 
> There's a one instruction regression in the deferredmultisampling demo
> compared to 8fb8ebfbb05d~1, but I can live with that.
> 
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107
> 
> I can't get a successful CI run even with Mesa master right now, so this is
> definitely not fully tested.

Your branch includes bug 109304 so it won't work with Aztec Ruins (which also regressed).

If you can provide patch series I can apply on top of latest Mesa, I could do full test with it.
Comment 19 Eero Tamminen 2019-01-14 11:00:11 UTC
Sorry, I somehow missed the download link (I haven't really looked at gitlab MRs before):
https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107.patch

Performance-wise your changes mostly fix the regressing test-cases.

There are still following gaps:
* Deferred Multisampling: 0.5% (down from 27%)
* Compute N-Body: 3% (down from 9%)

Aztec Ruins and Compute Raytracing gaps seem to have been fixed to within daily variation.

While I think this is fairly acceptable, could you check also Compute N-Body, why it still has a gap?
Comment 20 Ian Romanick 2019-01-14 18:00:58 UTC
(In reply to Eero Tamminen from comment #19)
> Sorry, I somehow missed the download link (I haven't really looked at gitlab
> MRs before):
> https://gitlab.freedesktop.org/mesa/mesa/merge_requests/107.patch
> 
> Performance-wise your changes mostly fix the regressing test-cases.
> 
> There are still following gaps:
> * Deferred Multisampling: 0.5% (down from 27%)
> * Compute N-Body: 3% (down from 9%)

I was not able to reproduce this on my system.  With system Mesa and the branch from the MR I get 34-35 fps.  I looked at the assembly generated before 8fb8ebfbb05d and after this series, and, aside from slightly different register assignment, there's no difference.
Comment 21 Eero Tamminen 2019-01-16 11:26:28 UTC
N-Body test has larger daily variance than the other tests.  Looking at that for last ~3 month, it can be up to 3% i.e. the gap I saw.  As you don't see any gap in this test, I guess I just got a kernel boot where allocation alignments happened so that perf was on the lower side of the variance.

I.e. from my side, everything is good.  I'll recheck the perf once the fix is in Mesa.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.