Bug 107510

Summary: [GEN8+] up to 10% perf drop on several 3D benchmarks
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/DRI/i965Assignee: Jason Ekstrand <jason>
Status: VERIFIED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium CC: agomez, mark.a.janes
Version: gitKeywords: bisected, regression
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=107706
https://bugs.freedesktop.org/show_bug.cgi?id=107743
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 109535    

Description Eero Tamminen 2018-08-07 11:42:34 UTC
There's up to 10% performance drop on *all* GEN8+ platforms.

On SKL GT2, the regressions are following:
- 10% GfxBench Tessellation (onscreen & offscreen)
- 5% SynMark2 GSCloth
- 2-3% GpuTest Julia FP64
- 1-2% Unigine Heaven (high quality with tessellation, no AA)
- 1% GfxBench ALU2 (onscreen & offscreen), SynMark PSPom

On other GEN8+ platforms the regressing cases are the same, but percentages differ somewhat.  Performance didn't improve in any test.

Setup:
- Mesa from git
- drm-tip kernel and X server git versions within couple of months
- Modifier support enabled (shouldn't matter as offscreen versions also regress)

Regressions happen between following Mesa git versions:
0ea243dcd502e18fde160179faf90639fb012288 2018-07-30 16:31:05 freedreno/a5xx: fix txf_ms
d5175d21c7e8e205a1ea80644b3cc887a586e6e1 2018-08-02 23:30:57 nir: add fall through comment to nir_gather_info

Between these there are a lot of commits that could potentially affect performance, but I don't currently have time to bisect this (maybe next week).  I filed this in hopes that somebody else would have...
Comment 1 Mark Janes 2018-08-07 19:53:33 UTC
Bisected to series ending in:

de9e5cf35a1fd0c66288e49c30dc100ad3e1d3aa
Author:     Jason Ekstrand <jason@jlekstrand.net>
anv/pipeline: Add populate_tcs/tes_key helpers

They don't really do anything interesting, but it's more consistent this
way.

Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>


-------------------------------------
More than one commit in the series drops the tess benchmark score.
Comment 2 Mark Janes 2018-08-07 20:57:45 UTC
more precise bisection:

4434591bf56a6b0c193ef209ea9d7b9e3c95a522
Author:     Jason Ekstrand <jason@jlekstrand.net>
AuthorDate: Tue Jul 31 06:16:34 2018 -0700
Commit:     Jason Ekstrand <jason@jlekstrand.net>
CommitDate: Wed Aug 1 18:02:28 2018 -0700

Parent:     b0bb547f782 intel/nir: Split IO arrays into elements
Containing: master perf
Follows:    18.1-branchpoint (2138)
Precedes:   mesa-18.2.0-rc1 (15)

intel/nir: Call nir_lower_io_to_scalar_early

Shader-db results on Kaby Lake:

    total instructions in shared programs: 15166953 -> 15073611 (-0.62%)
    instructions in affected programs: 2390284 -> 2296942 (-3.91%)
    helped: 16469
    HURT: 505

    total loops in shared programs: 4954 -> 4951 (-0.06%)
    loops in affected programs: 3 -> 0
    helped: 3
    HURT: 0

Reviewed-by: Timothy Arceri <tarceri@itsqueeze.com>
Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
Comment 3 Eero Tamminen 2018-08-08 09:47:21 UTC
(In reply to Mark Janes from comment #2)
> more precise bisection:
...
> intel/nir: Call nir_lower_io_to_scalar_early

Thanks!  Does this explain also the performance drops in other tests?

GfxBench Tessellation test is tessellation shader bound, SynMark GSCloth is transform feedback / streamout / geometry shader test, GpuTest Julia FP64 is fragment shader ALU test, GfxBench ALU2 is ALU & texture access scheduling test and Heaven is bound by overall fragment shader perf.


PS. any idea about whether something is going to be done about regression bug 107223, which seems to cause E2E RBC getting disabled in several tests.
Comment 4 Eero Tamminen 2018-08-15 16:00:00 UTC
Jason has posted a revert for the regressing commit:
  https://patchwork.freedesktop.org/patch/243409/

"Commit 4434591bf56a6b0 caused substantially more URB messages in
geometry and tessellation shaders.  Before we can really enable this
sort of optimization,  We either need some way of combining them back
together into vectors or we need to do cross-stage vector element
elimination without splitting everything into scalars."


This explains GfxBench tessellation, SynMark GSCloth and potentially also Unigine Heaven+tessellation regressions.

However, GfxBench ALU2, GpuTest Julia FP64 and SynMark PSPom don't use those features, so I wonder whether their (small) regressions come from some other commit... Mark?
Comment 5 Mark Janes 2018-08-15 16:16:24 UTC
The i965 Mesa performance CI was reduced in scope a while back, and has not included many of these workloads.  I've been working to add them to the standard execution.  It takes time, but I'll be in a good position to evaluate all the workloads in the next day or two.
Comment 6 Jason Ekstrand 2018-08-16 22:28:07 UTC
I reverted the patch:

commit 10f44da775a69561c77438507298363ff4eeb65d
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Wed Aug 8 12:00:55 2018 -0700

    Revert "intel/nir: Call nir_lower_io_to_scalar_early"
    
    Commit 4434591bf56a6b0 caused substantially more URB messages in
    geometry and tessellation shaders.  Before we can really enable this
    sort of optimization,  We either need some way of combining them back
    together into vectors or we need to do cross-stage vector element
    elimination without splitting everything into scalars.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510
    Fixes: 4434591bf56a6 "intel/nir: Call nir_lower_io_to_scalar_early"
    Acked-by: Kenneth Graunke <kenneth@whitecape.org>
    Tested-by: Mark Janes <mark.a.janes@intel.com>
Comment 7 Eero Tamminen 2018-08-28 09:26:12 UTC
Verified.
Comment 8 Mark Janes 2018-08-29 21:05:31 UTC
this regression appears to be back, somewhere between 	6027d354d1d and 	45b5f5fa25b
Comment 9 Timothy Arceri 2018-10-20 01:05:29 UTC
(In reply to Mark Janes from comment #8)
> this regression appears to be back, somewhere between 	6027d354d1d and 
> 45b5f5fa25b

It was re-enabled [1].

I'm currently working on a pass to recombine things back into vectors that should solve this as it seem we might have similar issues with the AMD NIR backends.

[1] https://cgit.freedesktop.org/mesa/mesa/commit/?id=8d8222461f9d7f497d657c2c0eff70820986429b
Comment 10 Jason Ekstrand 2018-10-20 03:14:28 UTC
Ugh... I didn't intend to turn io_to_scalar_early back on in that patch.... Yeah, we have some serious issues with making everything scalar for geometry and other stages.  We should probably disable it again unless doing so hurts something badly.
Comment 11 Eero Tamminen 2018-10-22 07:15:15 UTC
(In reply to Timothy Arceri from comment #9)
> (In reply to Mark Janes from comment #8)
> > this regression appears to be back, somewhere between 	6027d354d1d and 
> > 45b5f5fa25b
> 
> It was re-enabled [1].
> 
> I'm currently working on a pass to recombine things back into vectors that
> should solve this as it seem we might have similar issues with the AMD NIR
> backends.
> 
> [1]
> https://cgit.freedesktop.org/mesa/mesa/commit/
> ?id=8d8222461f9d7f497d657c2c0eff70820986429b

I filed new bug 107706 for that 2 months ago, there's no need to re-open this.
Comment 12 Timothy Arceri 2018-10-24 07:26:52 UTC
*** Bug 107706 has been marked as a duplicate of this bug. ***
Comment 13 Timothy Arceri 2018-10-24 07:48:33 UTC
(In reply to Eero Tamminen from comment #11)
> I filed new bug 107706 for that 2 months ago, there's no need to re-open
> this.

It's the exact same bug and since this one has the relevant information I've marked that one as a duplicate.

I've written a new pass which fixes the regression in GfxBench Tessellation, just finishing some testing. Hopefully will get some patches on the list tomorrow.

[1] https://gitlab.freedesktop.org/tarceri/mesa/commits/vectorize_io_v2
Comment 14 Eero Tamminen 2018-10-29 14:41:02 UTC
Because exactly same tests were affected both times this perf regression was introduced to Mesa, I think it's also the cause for the small regressions in the listed fragment shader bound tests:
- GpuTest v0.7 Julia FP64
- GfxBench ALU2
- Unigine Heaven
- SynMark PSPom

Not just for the larger regressions in the 2 Tessellation & Geometry shader tests.


Of these, PSPom regression was fixed between these commits:
18cc65edf8: 2018-10-15 17:56:12 i965: Drop assert about number of uniforms in ARB handling
322a919a41: 2018-10-16 11:47:55 anv: Implement VK_EXT_pci_bus_info

(My guess is that "nir: Copy propagation between blocks" commit fixed it.)


Timothy's latest patch series:
  https://patchwork.freedesktop.org/series/51642/

Fixes just the larger tessellation & geometry shader regressions, not the smaller fragment shader ones (expected as latest version disables the optimization pass for fragment stage).
Comment 15 Timothy Arceri 2018-11-01 00:24:03 UTC
(In reply to Eero Tamminen from comment #14)
> Because exactly same tests were affected both times this perf regression was
> introduced to Mesa, I think it's also the cause for the small regressions in
> the listed fragment shader bound tests:
> - GpuTest v0.7 Julia FP64
> - GfxBench ALU2
> - Unigine Heaven
> - SynMark PSPom
> 
> Not just for the larger regressions in the 2 Tessellation & Geometry shader
> tests.
> 
> 
> Of these, PSPom regression was fixed between these commits:
> 18cc65edf8: 2018-10-15 17:56:12 i965: Drop assert about number of uniforms
> in ARB handling
> 322a919a41: 2018-10-16 11:47:55 anv: Implement VK_EXT_pci_bus_info
> 
> (My guess is that "nir: Copy propagation between blocks" commit fixed it.)
> 
> 
> Timothy's latest patch series:
>   https://patchwork.freedesktop.org/series/51642/
> 
> Fixes just the larger tessellation & geometry shader regressions, not the
> smaller fragment shader ones (expected as latest version disables the
> optimization pass for fragment stage).

Splitting and/or vectorising FS inputs seems to case the scheduler to vary wildly. I'm seeing up to +/-50% swings in cycle counts in both directions. I'm not sure what to do about this.
Comment 16 Eero Tamminen 2018-11-01 14:00:06 UTC
(In reply to Timothy Arceri from comment #15)
> Splitting and/or vectorising FS inputs seems to case the scheduler to vary
> wildly. I'm seeing up to +/-50% swings in cycle counts in both directions.

I don't trust scheduler cycle counts much for shaders doing sampling.

Texture fetches have most impact on perf/cycles, but when I last looked at this:
* Cycle counts are obviously wrong for SIMD32 (when comparing reported cycle changes to the actual perf change), so I don't think it takes latency compensation from threading properly into account
* I'm not sure how well it takes cache impact for texture fetches into account either (on average first fetch for given sampler takes more cycles than successive ones, and grouping few fetches to same sampler together and distancing such groups from each other, uses less cycles as that minimizes cache trashing.  Windows driver produced shader assembly looks on average noticeably better in this respect than i965 produced assembly)


> I'm not sure what to do about this.

IMHO i965 instruction scheduler cycle accounting and texture fetch scheduling would need to be improved.  Not sure who can do that though.

I think that for this bug, your patch is enough though.
Comment 17 Eero Tamminen 2018-12-13 18:12:24 UTC
(In reply to Eero Tamminen from comment #14)
> Timothy's latest patch series:
>   https://patchwork.freedesktop.org/series/51642/
> 
> Fixes just the larger tessellation & geometry shader regressions, not the
> smaller fragment shader ones (expected as latest version disables the
> optimization pass for fragment stage).

Timothy, is there something preventing merging of this?  I didn't see any regressions from you patch, just improvements in the regressed cases.
Comment 18 Eero Tamminen 2019-01-30 13:25:52 UTC
*** Bug 107743 has been marked as a duplicate of this bug. ***
Comment 19 Eero Tamminen 2019-02-12 09:00:12 UTC
Up to ~20% perf drops in Sascha Willems' Vulkan Terrain / tessellation tests, listed here:
https://share.lunarg.com/vulkan/report/2957#performance

In loop & replay variants (not loadtime) since Mesa 18.1.1, seem likely to be caused by this too.
Comment 20 Kenneth Graunke 2019-02-14 20:29:56 UTC
There are too many things being tracked in this bug for me to know whether it's fixed or not.

Vulkan regressions are likely fixed by 39aee57523a02552e7eae7df5da488e535aeb1eb (anv: Put MOCS in the correct location).  But, that wouldn't affect any GL programs, so any GL regressions may still exist.
Comment 21 Mark Janes 2019-02-14 22:11:40 UTC
I think Jason meant this to block the 19.0 release, instead depending on it.
Comment 22 Eero Tamminen 2019-02-18 09:31:39 UTC
(In reply to Kenneth Graunke from comment #20)
> There are too many things being tracked in this bug for me to know whether
> it's fixed or not.

This bug tracks just regression coming from this commit:
-------------------------------------------------
commit 8d8222461f9d7f497d657c2c0eff70820986429b
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Mon Jul 23 22:20:41 2018 -0700
    intel/nir: Enable nir_opt_find_array_copies
-------------------------------------------------

Which re-introduces back the same issue that got fixed in Jason's revert of "intel/nir: Call nir_lower_io_to_scalar_early" commit.

Timothy's patch series fixing the issue (but being explicitly limited to tess/geom shaders) with following explanation:
   Commit 8d8222461f9d7f49 caused substantially more URB messages in
   geometry and tessellation shaders (due to enabling
   nir_lower_io_to_scalar_early). This combines io again to avoid
   this regression while still allowing link time optimisation of
   components.


> Vulkan regressions are likely fixed by
> 39aee57523a02552e7eae7df5da488e535aeb1eb (anv: Put MOCS in the correct
> location).  But, that wouldn't affect any GL programs, so any GL regressions
> may still exist.

If your commit didn't fix them, IMHO this bug is most likely explanation for tessellation regression.
Comment 23 Jason Ekstrand 2019-03-08 16:52:13 UTC
I've got an MR to fix this:

https://gitlab.freedesktop.org/mesa/mesa/merge_requests/409
Comment 24 Jason Ekstrand 2019-03-12 15:36:26 UTC
This is fixed by the following commit now in master:

commit 6d5d89d25a0a4299dbfcbfeca71b6c7e65ef3d45 (HEAD -> master, origin/master, origin/HEAD)
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Thu Mar 7 15:01:37 2019 -0600

    intel/nir: Vectorize all IO
    
    The IO scalarization pass that we run to help with linking end up
    turning some shader I/O such as that for tessellation and geometry
    shaders into many scalar URB operations rather than one vector one.  To
    alleviate this, we now vectorize the I/O once again.  This fixes a 10%
    performance regression in the GfxBench tessellation test that was caused
    by scalarizing.
    
    Shader-db results on Kaby Lake:
    
        total instructions in shared programs: 15224023 -> 15220871 (-0.02%)
        instructions in affected programs: 342009 -> 338857 (-0.92%)
        helped: 1236
        HURT: 443
    
        total spills in shared programs: 23471 -> 23465 (-0.03%)
        spills in affected programs: 6 -> 0
        helped: 1
        HURT: 0
    
        total fills in shared programs: 31770 -> 31766 (-0.01%)
        fills in affected programs: 4 -> 0
        helped: 1
        HURT: 0
    
    Cycles was just a lot of churn do to moves being different places.  Most
    of the pure churn in instructions was +/- one or two instructions in
    fragment shaders.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107510
    Fixes: 4434591bf56a "intel/nir: Call nir_lower_io_to_scalar_early"
    Fixes: 8d8222461f9d "intel/nir: Enable nir_opt_find_array_copies"
    Reviewed-by: Connor Abbott <cwabbott0@gmail.com>
Comment 25 Eero Tamminen 2019-03-13 11:41:32 UTC
(In reply to Eero Tamminen from comment #0)
> On SKL GT2, the regressions are following:
> - 10% GfxBench Tessellation (onscreen & offscreen)
> - 5% SynMark2 GSCloth
> - 2-3% GpuTest Julia FP64
> - 1-2% Unigine Heaven (high quality with tessellation, no AA)
> - 1% GfxBench ALU2 (onscreen & offscreen), SynMark PSPom

(In reply to Jason Ekstrand from comment #24)
> This is fixed by the following commit now in master:
...
>     intel/nir: Vectorize all IO

All the other tests improved by the amount regressed, except for these two which were not affected by the fix:
- GpuTest Julia FP64
- SynMark PSPom

I assume PSPom drop was fixed already in October, with the "nir: Copy propagation between blocks" commit (but it regressed last month, I'll file separate bug for that).  FP64 case isn't really relevant.

VERIFIED.
Comment 26 Eero Tamminen 2019-03-27 12:20:30 UTC
Fix to this (or commit near it) improved also FurMark by ~1%.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.