Bug 98409

Summary: 4.5% perf drop in CSDof with "nir: Optimize integer division and modulus with 1"
Product: Mesa Reporter: Eero Tamminen <eero.t.tamminen>
Component: Drivers/DRI/i965Assignee: Ian Romanick <idr>
Status: CLOSED FIXED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=98299
Whiteboard:
i915 platform: i915 features:

Description Eero Tamminen 2016-10-24 11:24:25 UTC
Following commit drops SynMark2 CSDof test performance on all platforms supporting compute shaders:

commit 4d35683d91e3d61bf14b76d801bf6ae17237e162
Author: Ian Romanick <ian.d.romanick@intel.com>
Date:   Wed Oct 19 08:53:10 2016 -0700

    nir: Optimize integer division and modulus with 1
    
    The previous power-of-two rules didn't catch idiv (because i965 doesn't
    set lower_idiv) and imod cases.  The udiv and umod cases should have
    been caught, but I included them for orthogonality.
    
    This fixes silly code observed from compute shaders with local_size_[xy]
    = 1.

Commit seems clear optimization, so I assume this regression is some kind of bad interaction in the optimization passes.

On SKL GT2 drop is 4.5% and more on on GT4(e).  INTEL_DEBUG=perf reports a lot of register spilling and tells about inefficient fallback code for CS variable indexing with this test.   No other tests besides CSDof were affected, but it's the only test in our set that is register spilling currently.
Comment 1 Mark Janes 2016-10-24 16:46:28 UTC
I see the same perf regression in my results.
Comment 2 Matt Turner 2016-10-24 18:38:00 UTC
Strange. I don't see shader-db differences, and I just recaptured the CSDof shaders and don't see differences there either.

None of the shaders seem to use intdiv and only one uses intmod (by 0xc0).
Comment 3 Matt Turner 2016-10-24 22:39:32 UTC
Jordan noted that I likely needed to revert

i965/cs: Use udiv/umod for local IDs
i965/cs: Don't use a thread channel ID for small local sizes

as well in order to see a difference. Indeed, I do, though it looks like the three patches helped instruction and cycle counts in four compute shaders by small amounts. Notably, no differences in spills or fills.
Comment 4 Eero Tamminen 2016-10-25 11:30:16 UTC
Confirming Matt's observation.

Reverting the indicated change from Mesa HEAD, doesn't affect the performance, although going back to version before this commit (or commits before it), shows 4-5% better perf than the version built from the commit.  I.e. additional changes after the indicated commit also have effect on CSDof test.
Comment 5 Jordan Justen 2016-10-27 20:45:47 UTC
(In reply to Eero Tamminen from comment #4)
> Confirming Matt's observation.
> 
> Reverting the indicated change from Mesa HEAD, doesn't affect the
> performance, although going back to version before this commit (or commits
> before it), shows 4-5% better perf than the version built from the commit. 
> I.e. additional changes after the indicated commit also have effect on CSDof
> test.

When reverting:

4d35683 nir: Optimize integer division and modulus with 1

Did you also revert these?

64c3d73 i965/cs: Don't use a thread channel ID for small local sizes
1fa000a i965/cs: Use udiv/umod for local IDs

It would be good to know if taking the current master, and
reverting these 3 patches regains the 4-5%.
Comment 6 Eero Tamminen 2016-10-28 11:09:27 UTC
(In reply to Jordan Justen from comment #5)
> When reverting:
> 
> 4d35683 nir: Optimize integer division and modulus with 1
> 
> Did you also revert these?

No.


> 64c3d73 i965/cs: Don't use a thread channel ID for small local sizes
> 1fa000a i965/cs: Use udiv/umod for local IDs
> 
> It would be good to know if taking the current master, and
> reverting these 3 patches regains the 4-5%.

Tried that now.

If one builds commit 4d35683, reverting just that commit fixes the drop.  With HEAD, that's not enough, but reverting all 3 patches does fix the drop there too.
Comment 7 Eero Tamminen 2016-10-28 14:53:46 UTC
Note: performance-wise I'm less worried in this test about perf drop than these INTEL_DEBUG=perf warnings (and resulting spilling):
-----------------
Unsupported form of variable indexing in CS; falling back to very inefficient code generation
-----------------
Comment 8 Kenneth Graunke 2017-02-03 09:25:54 UTC
Probably not worth tracking this old regression - we've since improved performance a lot.  Closing.
Comment 9 Eero Tamminen 2017-02-03 09:30:30 UTC
verified/closed

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.