Following commit regresses performance hugely with DRI3 in synthetic benchmarks both on Sandybridge and Broxton.
Author: Michel Dänzer <firstname.lastname@example.org>
AuthorDate: Wed Aug 17 17:02:04 2016 +0900
Commit: Michel Dänzer <email@example.com>
CommitDate: Thu Aug 25 17:40:24 2016 +0900
loader/dri3: Overhaul dri3_update_num_back
Always use 3 buffers when flipping. With only 2 buffers, we have to wait
for a flip to complete (which takes non-0 time even with asynchronous
flips) before we can start working on the next frame. We were previously
only using 2 buffers for flipping if the X server supports asynchronous
flips, even when we're not using asynchronous flips. This could result
in bad performance (the referenced bug report is an extreme case, where
the inter-frame stalls were preventing the GPU from reaching its maximum
I couldn't measure any performance boost using 4 buffers with flipping.
Performance actually seemed to go down slightly, but that might have
been just noise.
Without flipping, a single back buffer is enough for swap interval 0,
but we need to use 2 back buffers when the swap interval is non-0,
otherwise we have to wait for the swap interval to pass before we can
start working on the next frame. This condition was previously reversed.
Cc: "12.0 11.2" <firstname.lastname@example.org>
Reviewed-by: Frank Binns <email@example.com>
Reviewed-by: Eric Anholt <firstname.lastname@example.org>
Reverting the batch restores earlier performance (bisect done on Broxton, revert tested on Sandybridge, so same commit is problem for both).
Impact is larger for tests with higher FPS, and naturally affects only onscreen versions of the tests. Both fullscreen and windowed+composited tests were affected.
On Sandybridge impact is up to 35% (SynMark Batch tests), 25% in GpuTest Triangle test, and less in other tests.
On Broxton the drop affects more tests (due to better GPU, heavier tests have higher FPS), even few tests that are normally fully ALU bound:
* SynMark v6: up to 40% (Batch tests)
* GfxBench v4: 35% ALU, 25% Driver, 10% Tess tests
* Lightsmark 2008: 20%
* GpuTest 0.7: 15% Triangle, Julia32 & Plot3D tests
* GLB 2.7: 10% Egypt
The change doesn't seem to affect HSW, BDW nor SKL. I don't know why.
Issue doesn't seem to be related to FPS (occasionally) being limited to some multiple of 60 FPS like in the earlier DRI3 perf bug. My assumption is that the buffering change indirectly affects some memory setting, but I don't know what, as SNB & BXT different in that respect:
* SNB has LLC, but BXT doesn't
* AFAIK Intel DDX supports SNA for SNB, but not yet for BXT
Please attach the corresponding Xorg log file.
It would also be interesting to know which path dri3_update_num_back takes before and after the change for each affected test.
Created attachment 126171 [details]
(In reply to Michel Dänzer from comment #1)
> Please attach the corresponding Xorg log file.
Attached, didn't seem to have anything significant (only difference to version before the issue is timestamps, build IDs and some additional input device lines).
Does the problem also occur with the modesetting driver instead of intel?
If yes, I'm afraid we can't make progress on this bug without seeing at least the number of buffers used before and after my change in the affected cases, preferably also the values of the variables used to decide that number.
Created attachment 126214 [details] [review]
Always use at least two buffers
Does this patch fix the problem?
(In reply to Michel Dänzer from comment #4)
> Created attachment 126214 [details] [review] [review]
> Always use at least two buffers
> Does this patch fix the problem?
Yes, but that's not all...
BXT performance raised a lot in the affected programs end of last week (some more than the drop, some less than the drop). This was due Chris' commit adding BXT & KBL PCI IDs to xf86-video-intel. I assume with this, X will be using SNA also on BXT instead of the "default acceleration" that uses (slow) legacy blitter.
So, I did few tests:
* BXT: reverting the "Overhaul dri3_update_num_back" with the new X DDX didn't affect the performance (same good one)
* BXT: with old X DDX attached patch fixes the drops
* BXT: with new X DDX attached patch doesn't give additional improvement
* SNB: attached patch fixes the drops
- This isn't necessarily Mesa issue, but X DDX one
- Something fishy with SNB for which it would be good to get comment from Chris
-Check modesetting driver perf with the before and after the commit
- Provide requested info from GDB
* BXT: similar to new X DDX, perf OK (somewhat below X DDX)
* SNB: perf still bad, but better than with Intel DDX
-> on SNB the attached patch is needed to get performance on previous level regardless of DDX.
Backtraces are from BXT, both with old (last month) X Intel DDX, before and after the commit. Earlier dri3_update_num_back() was called when getting buffers, now it's called when swapping.
Created attachment 126221 [details]
backtrace before change
Created attachment 126222 [details]
backtrace after change
Thanks for the information and testing! Fix pushed to master:
Author: Michel Dänzer <email@example.com>
Date: Tue Sep 6 11:34:49 2016 +0900
loader/dri3: Always use at least two back buffers
Verified. This affected also KBL (for which Intel DDX still doesn't seem to use SNA) and potentially BYT too.