Bug 61554

Summary:	[ivb] Mesa 9.1 performance regression on KWin's Lanczos shader
Product:	Mesa	Reporter:	Adam Lyall <magicmyth>
Component:	Drivers/DRI/i965	Assignee:	Eric Anholt <eric>
Status:	RESOLVED FIXED	QA Contact:
Severity:	normal
Priority:	high	CC:	jlp.bugs, mgraesslin, mike, nicepics13, sarvatt, stavallo, tjaalton
Version:	9.1
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
See Also:	https://bugs.kde.org/show_bug.cgi?id=313613
Whiteboard:
i915 platform:		i915 features:
Attachments:	glxinfo-l Output of glxinfo -l with mesa 9.2.0 (git-edf2150)

Description Adam Lyall 2013-02-27 13:45:28 UTC

Created attachment 75637 [details]
glxinfo-l

Using an Intel HD4000 and Mesa 9.1 and later git will cause all the scaling related effects in Kwin to become extremely slow.

Changing the scaling methods used by Kwin in advance desktop effect settings removes the problem.

Also disabling the OpenGL 2 shaders fixes the problem.

Rolling back to Mesa 9 fixes the problem.

Please see my Kwin bug report:
https://bugs.kde.org/show_bug.cgi?id=313613 for further details.

Also some forums posts documenting the same problem:
https://bbs.archlinux.org/viewtopic.php?pid=1237363

Comment 1 Martin Flöser 2013-03-01 07:09:44 UTC

Investigation by one of our users showed that the following commit introduced the regression: http://cgit.freedesktop.org/mesa/mesa/commit/?h=9.1&id=a64c1eb9b110f29b8abf803a8256306702629bdc


i965/fs: Add support for uniform array access with a variable index.

Serious Sam 3 had a shader hitting this path, but it's used rarely so it didn't show a significant performance difference (n=7). It does reduce compile time massively, though -- one shader goes from 14s compile time and 11723 instructions generated to .44s and 499 instructions.

Note that some shaders lose 16-wide mode because we don't support 16-wide and pull constants at the moment (generally, things looping over a few-element array where the loop isn't getting unrolled). Given that those shaders are being generated with 15-20% fewer instructions, it probably outweighs the loss of 16-wide.

===
The regression seems to only happen for IvyBridge, I am not able to reproduce with SandyBridge. The shader in KWin causing problems can be found here: http://quickgit.kde.org/?p=kde-workspace.git&a=blob&h=3141d7907ce27415254409a19a16e18d8fe99721&hb=316c0c55b84d4e47b593229df143fd6397d23108&f=kwin%2Flanczos-fragment.glsl

Or just:

uniform sampler2D texUnit;
uniform vec2 offsets[16];
uniform vec4 kernel[16];

varying vec2 varyingTexCoords;

void main(void)
{
    vec4 sum = texture2D(texUnit, varyingTexCoords.st) * kernel[0];
    for (int i = 1; i < 16; i++) {
        sum += texture2D(texUnit, varyingTexCoords.st - offsets[i]) * kernel[i];
        sum += texture2D(texUnit, varyingTexCoords.st + offsets[i]) * kernel[i];
    }
    gl_FragColor = sum;
}


If there is anything we can do in the shader to improve, please let us know. For the time being we have disabled the shader on the affected driver/hardware combination: http://commits.kde.org/kde-workspace/78a9ca708a5764169c456b9cff2268327c09041f

Comment 2 Ian Romanick 2013-03-01 07:44:19 UTC

(In reply to comment #1)
> Investigation by one of our users showed that the following commit
> introduced the regression:
> http://cgit.freedesktop.org/mesa/mesa/commit/?h=9.
> 1&id=a64c1eb9b110f29b8abf803a8256306702629bdc
> 
> 
> i965/fs: Add support for uniform array access with a variable index.
> 
> Serious Sam 3 had a shader hitting this path, but it's used rarely so it
> didn't show a significant performance difference (n=7). It does reduce
> compile time massively, though -- one shader goes from 14s compile time and
> 11723 instructions generated to .44s and 499 instructions.
> 
> Note that some shaders lose 16-wide mode because we don't support 16-wide
> and pull constants at the moment (generally, things looping over a
> few-element array where the loop isn't getting unrolled). Given that those
> shaders are being generated with 15-20% fewer instructions, it probably
> outweighs the loss of 16-wide.
> 
> ===
> The regression seems to only happen for IvyBridge, I am not able to
> reproduce with SandyBridge. The shader in KWin causing problems can be found
> here:
> http://quickgit.kde.org/?p=kde-workspace.
> git&a=blob&h=3141d7907ce27415254409a19a16e18d8fe99721&hb=316c0c55b84d4e47b593
> 229df143fd6397d23108&f=kwin%2Flanczos-fragment.glsl
> 
> Or just:
> 
> uniform sampler2D texUnit;
> uniform vec2 offsets[16];
> uniform vec4 kernel[16];
> 
> varying vec2 varyingTexCoords;
> 
> void main(void)
> {
>     vec4 sum = texture2D(texUnit, varyingTexCoords.st) * kernel[0];
>     for (int i = 1; i < 16; i++) {
>         sum += texture2D(texUnit, varyingTexCoords.st - offsets[i]) *
> kernel[i];
>         sum += texture2D(texUnit, varyingTexCoords.st + offsets[i]) *
> kernel[i];
>     }

If the offsets are just biases to get at neighbouring pixels, it would be a lot faster on both Sandy Bridge and Ivy Bridge if you use textureOffset.  Something like:

    sum = texture(texUnit, varyingTexCoord.st) * kernel[0];
    sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(-1, 0)) * kernel[1];
    sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(-2, 0)) * kernel[2];
    ...
    sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(4, 4)) * kernel[15];
    gl_FragColor = sum;

That would cut out all the uniforms (and the need to update them!) and, literally, half the instructions from the shader.

That not withstanding, it sounds like we have some bug here that we should fix.

>     gl_FragColor = sum;
> }
> 
> 
> If there is anything we can do in the shader to improve, please let us know.
> For the time being we have disabled the shader on the affected
> driver/hardware combination:
> http://commits.kde.org/kde-workspace/78a9ca708a5764169c456b9cff2268327c09041f

Comment 3 Eric Anholt 2013-03-06 00:59:03 UTC

I'm betting it's related to "you really really need to use the sampler instead of the data cache" -- you happen to have a small enough set of uniforms that you had push constants before, so it's a likely explanation for things getting worse.  I'm looking into reviving the sampler change, but in the meantime, I'd also second the recommendation to use textureOffset(), which should be a big win.

Comment 4 Martin Flöser 2013-03-06 07:11:40 UTC

I have already started working on an implementation with textureOffset and it feels faster on SandyBridge. Unfortunately it will have to wait for the next feature release (4.11) as it needs some reworking to make use of later shading languages (so far we only use 1.0)

Comment 5 Eric Anholt 2013-03-07 03:06:29 UTC

24% improvement on a uniform-index uniform access case that's not nearly as uniform-dependent as what you've built here.  Yeah, I'd say we're on track for what this regression is.

(unfortunately, variable-indexed array access is a bit trickier to fix)

Comment 6 Eric Anholt 2013-03-13 22:59:39 UTC

The fs-varying-uniform branch of my tree has possible code to fix the problem.  I'd need to pull down kwin again, and this conference wireless is seriously weak.

http://cgit.freedesktop.org/~anholt/mesa

Comment 7 Eric Anholt 2013-03-14 08:00:48 UTC

My other question is why we aren't unrolling this loop.  It's a good candidate for doing so, and I bet we have in the past.  That would also get us the sampler pull constant load behavior we want.

Comment 8 Stefano Avallone 2013-03-14 08:40:49 UTC

I am the user who bisected mesa down to the reported commit. Do you want me to test your mesa branch? Any particular information you want me to report back?

Comment 9 Stefano Avallone 2013-03-14 10:13:20 UTC

Didn't wait :-)

Recompiled kwin (by commenting the two lines of code introduced in 4.10.1 to disable the Lanczos filter) and compiled mesa from Eric's fs-varying-uniform branch.

The problem seems to be fixed, as window switching is fast again.

I attach the output of glxinfo -l. To be noted are the following lines:

brwCreateContext: failed to init intel context
brwCreateContext: failed to init intel context
brwCreateContext: failed to init intel context
brwCreateContext: failed to init intel context
brwCreateContext: failed to init intel context
brwCreateContext: failed to init intel context

which never appeared with mesa 9.1.

Of course, if you need more information, just ask.

Comment 10 Stefano Avallone 2013-03-14 10:14:14 UTC

Created attachment 76519 [details]
Output of glxinfo -l with mesa 9.2.0 (git-edf2150)

Comment 11 Eric Anholt 2013-03-14 19:23:36 UTC

OK, Ken took a look, and those messages are coming from our driver under the new glxinfo, which is making a series of contexts of different versions (some of which we don't support, like GL 4.3).  We just need to make that old message shut up :)

I'm still mulling over whether this is the right fix or not, but I'm glad to hear it helps.

Comment 12 roberth 2013-03-14 22:15:00 UTC

These commits cherry-picked into 9.1 branch do indeed fix the slow blur problem in the unity dash as well

i965/fs: Improve performance of varying-index uniform loads on IVB.
i965/fs: Move varying uniform offset compuation into the helper func.
i965/fs: Remove creation of a MOV instruction that's never used.
i965: Remove the old brw_optimize() code.
i965/vs: Add a pass to set dependency control fields on instructions.

Comment 13 Timo Aaltonen 2013-03-15 13:33:14 UTC

..but they only fix it on IVB, SNB as well as my gen4 are equally regressed.

Comment 14 Eric Anholt 2013-03-15 23:19:02 UTC

New version of the branch is up that I'm happier with, but I want to poke at pre-ivb a bit before I settle on it.

Comment 15 Eric Anholt 2013-03-20 23:48:45 UTC

Timo: New tree at fs-varying-uniform-gen4, care to give it a shot?

Comment 16 Timo Aaltonen 2013-03-22 10:55:35 UTC

Thanks, looks like SNB is fixed with the new branch. And now I should probably admit that the test-case I had (unity dash) was already quite slow with 965GM on 9.0, so it didn't regress with 9.1. And this branch doesn't improve it either. I'll test 8.0.x later to see if there's a difference but I'd not worry about it too much for now.

Comment 17 Eric Anholt 2013-04-01 23:49:36 UTC

Fixed in the series ending with:

commit 62501c3af85089b423218a41a2e2433ac849c2d3
Author: Eric Anholt <eric@anholt.net>
Date:   Tue Mar 19 17:45:02 2013 -0700

    i965/fs: Allow CSE on pre-gen7 varying-index uniform loads

the release folks will pick it over to 9.1 before the next release, unless something goes wrong.

Comment 18 Franz Fellner 2013-04-02 06:59:38 UTC

Applied the patch and it did _not_ fix this issue for me (i7 2600K and i3-2350M). Still bad performance on "present windows" etc.

Comment 19 Franz Fellner 2013-04-02 07:24:18 UTC

I just saw that there are more then just this single commit you mentioned that might be related. Should I just pull all those patches starting from http://cgit.freedesktop.org/mesa/mesa/commit/?id=1d6ead38042cc0d1e667d8ff55937c1e32d108b1 ?

Comment 20 Eric Anholt 2013-04-03 00:53:21 UTC

Yes, you need the whole patch series.

Comment 21 Timo Aaltonen 2013-04-08 14:08:23 UTC

I've heard that gen5 regressed as well, and not fixed by this branch. Should this be reopened or a new bug filed?

Comment 22 Eric Anholt 2013-04-08 15:36:21 UTC

definitely new bug.

Comment 23 Stefano Avallone 2013-05-02 08:30:43 UTC

(In reply to comment #17)
> Fixed in the series ending with:
> 
> commit 62501c3af85089b423218a41a2e2433ac849c2d3
> Author: Eric Anholt <eric@anholt.net>
> Date:   Tue Mar 19 17:45:02 2013 -0700
> 
>     i965/fs: Allow CSE on pre-gen7 varying-index uniform loads
> 
> the release folks will pick it over to 9.1 before the next release, unless
> something goes wrong.

Mesa 9.1.2 has been released and it does not seem to include this patch series (pls correct me if I am wrong). Did the release folks forget to pick it or something went wrong? 

Thanks.

Comment 24 Eric Anholt 2013-05-02 15:57:53 UTC

It was a big invasive patch series and isn't cherry-picked back (I don't think we will)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.