Summary: | [ivb] Mesa 9.1 performance regression on KWin's Lanczos shader | ||
---|---|---|---|
Product: | Mesa | Reporter: | Adam Lyall <magicmyth> |
Component: | Drivers/DRI/i965 | Assignee: | Eric Anholt <eric> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | normal | ||
Priority: | high | CC: | jlp.bugs, mgraesslin, mike, nicepics13, sarvatt, stavallo, tjaalton |
Version: | 9.1 | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
See Also: | https://bugs.kde.org/show_bug.cgi?id=313613 | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
glxinfo-l
Output of glxinfo -l with mesa 9.2.0 (git-edf2150) |
Description
Adam Lyall
2013-02-27 13:45:28 UTC
Investigation by one of our users showed that the following commit introduced the regression: http://cgit.freedesktop.org/mesa/mesa/commit/?h=9.1&id=a64c1eb9b110f29b8abf803a8256306702629bdc i965/fs: Add support for uniform array access with a variable index. Serious Sam 3 had a shader hitting this path, but it's used rarely so it didn't show a significant performance difference (n=7). It does reduce compile time massively, though -- one shader goes from 14s compile time and 11723 instructions generated to .44s and 499 instructions. Note that some shaders lose 16-wide mode because we don't support 16-wide and pull constants at the moment (generally, things looping over a few-element array where the loop isn't getting unrolled). Given that those shaders are being generated with 15-20% fewer instructions, it probably outweighs the loss of 16-wide. === The regression seems to only happen for IvyBridge, I am not able to reproduce with SandyBridge. The shader in KWin causing problems can be found here: http://quickgit.kde.org/?p=kde-workspace.git&a=blob&h=3141d7907ce27415254409a19a16e18d8fe99721&hb=316c0c55b84d4e47b593229df143fd6397d23108&f=kwin%2Flanczos-fragment.glsl Or just: uniform sampler2D texUnit; uniform vec2 offsets[16]; uniform vec4 kernel[16]; varying vec2 varyingTexCoords; void main(void) { vec4 sum = texture2D(texUnit, varyingTexCoords.st) * kernel[0]; for (int i = 1; i < 16; i++) { sum += texture2D(texUnit, varyingTexCoords.st - offsets[i]) * kernel[i]; sum += texture2D(texUnit, varyingTexCoords.st + offsets[i]) * kernel[i]; } gl_FragColor = sum; } If there is anything we can do in the shader to improve, please let us know. For the time being we have disabled the shader on the affected driver/hardware combination: http://commits.kde.org/kde-workspace/78a9ca708a5764169c456b9cff2268327c09041f (In reply to comment #1) > Investigation by one of our users showed that the following commit > introduced the regression: > http://cgit.freedesktop.org/mesa/mesa/commit/?h=9. > 1&id=a64c1eb9b110f29b8abf803a8256306702629bdc > > > i965/fs: Add support for uniform array access with a variable index. > > Serious Sam 3 had a shader hitting this path, but it's used rarely so it > didn't show a significant performance difference (n=7). It does reduce > compile time massively, though -- one shader goes from 14s compile time and > 11723 instructions generated to .44s and 499 instructions. > > Note that some shaders lose 16-wide mode because we don't support 16-wide > and pull constants at the moment (generally, things looping over a > few-element array where the loop isn't getting unrolled). Given that those > shaders are being generated with 15-20% fewer instructions, it probably > outweighs the loss of 16-wide. > > === > The regression seems to only happen for IvyBridge, I am not able to > reproduce with SandyBridge. The shader in KWin causing problems can be found > here: > http://quickgit.kde.org/?p=kde-workspace. > git&a=blob&h=3141d7907ce27415254409a19a16e18d8fe99721&hb=316c0c55b84d4e47b593 > 229df143fd6397d23108&f=kwin%2Flanczos-fragment.glsl > > Or just: > > uniform sampler2D texUnit; > uniform vec2 offsets[16]; > uniform vec4 kernel[16]; > > varying vec2 varyingTexCoords; > > void main(void) > { > vec4 sum = texture2D(texUnit, varyingTexCoords.st) * kernel[0]; > for (int i = 1; i < 16; i++) { > sum += texture2D(texUnit, varyingTexCoords.st - offsets[i]) * > kernel[i]; > sum += texture2D(texUnit, varyingTexCoords.st + offsets[i]) * > kernel[i]; > } If the offsets are just biases to get at neighbouring pixels, it would be a lot faster on both Sandy Bridge and Ivy Bridge if you use textureOffset. Something like: sum = texture(texUnit, varyingTexCoord.st) * kernel[0]; sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(-1, 0)) * kernel[1]; sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(-2, 0)) * kernel[2]; ... sum += textureOffset(texUnit, varyingTexCoord.st, ivec2(4, 4)) * kernel[15]; gl_FragColor = sum; That would cut out all the uniforms (and the need to update them!) and, literally, half the instructions from the shader. That not withstanding, it sounds like we have some bug here that we should fix. > gl_FragColor = sum; > } > > > If there is anything we can do in the shader to improve, please let us know. > For the time being we have disabled the shader on the affected > driver/hardware combination: > http://commits.kde.org/kde-workspace/78a9ca708a5764169c456b9cff2268327c09041f I'm betting it's related to "you really really need to use the sampler instead of the data cache" -- you happen to have a small enough set of uniforms that you had push constants before, so it's a likely explanation for things getting worse. I'm looking into reviving the sampler change, but in the meantime, I'd also second the recommendation to use textureOffset(), which should be a big win. I have already started working on an implementation with textureOffset and it feels faster on SandyBridge. Unfortunately it will have to wait for the next feature release (4.11) as it needs some reworking to make use of later shading languages (so far we only use 1.0) 24% improvement on a uniform-index uniform access case that's not nearly as uniform-dependent as what you've built here. Yeah, I'd say we're on track for what this regression is. (unfortunately, variable-indexed array access is a bit trickier to fix) The fs-varying-uniform branch of my tree has possible code to fix the problem. I'd need to pull down kwin again, and this conference wireless is seriously weak. http://cgit.freedesktop.org/~anholt/mesa My other question is why we aren't unrolling this loop. It's a good candidate for doing so, and I bet we have in the past. That would also get us the sampler pull constant load behavior we want. I am the user who bisected mesa down to the reported commit. Do you want me to test your mesa branch? Any particular information you want me to report back? Didn't wait :-) Recompiled kwin (by commenting the two lines of code introduced in 4.10.1 to disable the Lanczos filter) and compiled mesa from Eric's fs-varying-uniform branch. The problem seems to be fixed, as window switching is fast again. I attach the output of glxinfo -l. To be noted are the following lines: brwCreateContext: failed to init intel context brwCreateContext: failed to init intel context brwCreateContext: failed to init intel context brwCreateContext: failed to init intel context brwCreateContext: failed to init intel context brwCreateContext: failed to init intel context which never appeared with mesa 9.1. Of course, if you need more information, just ask. Created attachment 76519 [details]
Output of glxinfo -l with mesa 9.2.0 (git-edf2150)
OK, Ken took a look, and those messages are coming from our driver under the new glxinfo, which is making a series of contexts of different versions (some of which we don't support, like GL 4.3). We just need to make that old message shut up :) I'm still mulling over whether this is the right fix or not, but I'm glad to hear it helps. These commits cherry-picked into 9.1 branch do indeed fix the slow blur problem in the unity dash as well i965/fs: Improve performance of varying-index uniform loads on IVB. i965/fs: Move varying uniform offset compuation into the helper func. i965/fs: Remove creation of a MOV instruction that's never used. i965: Remove the old brw_optimize() code. i965/vs: Add a pass to set dependency control fields on instructions. ..but they only fix it on IVB, SNB as well as my gen4 are equally regressed. New version of the branch is up that I'm happier with, but I want to poke at pre-ivb a bit before I settle on it. Timo: New tree at fs-varying-uniform-gen4, care to give it a shot? Thanks, looks like SNB is fixed with the new branch. And now I should probably admit that the test-case I had (unity dash) was already quite slow with 965GM on 9.0, so it didn't regress with 9.1. And this branch doesn't improve it either. I'll test 8.0.x later to see if there's a difference but I'd not worry about it too much for now. Fixed in the series ending with: commit 62501c3af85089b423218a41a2e2433ac849c2d3 Author: Eric Anholt <eric@anholt.net> Date: Tue Mar 19 17:45:02 2013 -0700 i965/fs: Allow CSE on pre-gen7 varying-index uniform loads the release folks will pick it over to 9.1 before the next release, unless something goes wrong. Applied the patch and it did _not_ fix this issue for me (i7 2600K and i3-2350M). Still bad performance on "present windows" etc. I just saw that there are more then just this single commit you mentioned that might be related. Should I just pull all those patches starting from http://cgit.freedesktop.org/mesa/mesa/commit/?id=1d6ead38042cc0d1e667d8ff55937c1e32d108b1 ? Yes, you need the whole patch series. I've heard that gen5 regressed as well, and not fixed by this branch. Should this be reopened or a new bug filed? definitely new bug. (In reply to comment #17) > Fixed in the series ending with: > > commit 62501c3af85089b423218a41a2e2433ac849c2d3 > Author: Eric Anholt <eric@anholt.net> > Date: Tue Mar 19 17:45:02 2013 -0700 > > i965/fs: Allow CSE on pre-gen7 varying-index uniform loads > > the release folks will pick it over to 9.1 before the next release, unless > something goes wrong. Mesa 9.1.2 has been released and it does not seem to include this patch series (pls correct me if I am wrong). Did the release folks forget to pick it or something went wrong? Thanks. It was a big invasive patch series and isn't cherry-picked back (I don't think we will) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.