Bug 97549 - [SNB, BXT] up to 40% perf drop from "loader/dri3: Overhaul dri3_update_num_back" commit
Summary: [SNB, BXT] up to 40% perf drop from "loader/dri3: Overhaul dri3_update_num_ba...
Alias: None
Product: Mesa
Classification: Unclassified
Component: Other (show other bugs)
Version: git
Hardware: x86-64 (AMD64) All
: high normal
Assignee: mesa-dev
QA Contact: mesa-dev
Depends on:
Reported: 2016-08-31 13:55 UTC by Eero Tamminen
Modified: 2016-09-07 13:02 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:

Xorg log (18.05 KB, text/plain)
2016-09-02 12:42 UTC, Eero Tamminen
Always use at least two buffers (472 bytes, patch)
2016-09-05 08:27 UTC, Michel Dänzer
Details | Splinter Review
backtrace before change (3.64 KB, text/plain)
2016-09-05 15:28 UTC, Eero Tamminen
backtrace after change (2.71 KB, text/plain)
2016-09-05 15:28 UTC, Eero Tamminen

Description Eero Tamminen 2016-08-31 13:55:01 UTC
Following commit regresses performance hugely with DRI3 in synthetic benchmarks both on Sandybridge and Broxton.

commit 1e3218bc5ba2b739261f0c0bacf4eb662d377236
Author:     Michel Dänzer <michel.daenzer@amd.com>
AuthorDate: Wed Aug 17 17:02:04 2016 +0900
Commit:     Michel Dänzer <michel@daenzer.net>
CommitDate: Thu Aug 25 17:40:24 2016 +0900

    loader/dri3: Overhaul dri3_update_num_back
    Always use 3 buffers when flipping. With only 2 buffers, we have to wait
    for a flip to complete (which takes non-0 time even with asynchronous
    flips) before we can start working on the next frame. We were previously
    only using 2 buffers for flipping if the X server supports asynchronous
    flips, even when we're not using asynchronous flips. This could result
    in bad performance (the referenced bug report is an extreme case, where
    the inter-frame stalls were preventing the GPU from reaching its maximum
    I couldn't measure any performance boost using 4 buffers with flipping.
    Performance actually seemed to go down slightly, but that might have
    been just noise.
    Without flipping, a single back buffer is enough for swap interval 0,
    but we need to use 2 back buffers when the swap interval is non-0,
    otherwise we have to wait for the swap interval to pass before we can
    start working on the next frame. This condition was previously reversed.
    Cc: "12.0 11.2" <mesa-stable@lists.freedesktop.org>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=97260
    Reviewed-by: Frank Binns <frank.binns@imgtec.com>
    Reviewed-by: Eric Anholt <eric@anholt.net>

Reverting the batch restores earlier performance (bisect done on Broxton, revert tested on Sandybridge, so same commit is problem for both).

Impact is larger for tests with higher FPS, and naturally affects only onscreen versions of the tests.  Both fullscreen and windowed+composited tests were affected.

On Sandybridge impact is up to 35% (SynMark Batch tests), 25% in GpuTest Triangle test, and less in other tests.

On Broxton the drop affects more tests (due to better GPU, heavier tests have higher FPS), even few tests that are normally fully ALU bound:
* SynMark v6: up to 40% (Batch tests)
* GfxBench v4: 35% ALU, 25% Driver, 10% Tess tests
* Lightsmark 2008: 20%
* GpuTest 0.7: 15% Triangle, Julia32 & Plot3D tests
* GLB 2.7: 10% Egypt

The change doesn't seem to affect HSW, BDW nor SKL.  I don't know why.

Issue doesn't seem to be related to FPS (occasionally) being limited to some multiple of 60 FPS like in the earlier DRI3 perf bug.  My assumption is that the buffering change indirectly affects some memory setting, but I don't know what, as SNB & BXT different in that respect:
* SNB has LLC, but BXT doesn't
* AFAIK Intel DDX supports SNA for SNB, but not yet for BXT
Comment 1 Michel Dänzer 2016-09-01 01:16:21 UTC
Please attach the corresponding Xorg log file.

It would also be interesting to know which path dri3_update_num_back takes before and after the change for each affected test.
Comment 2 Eero Tamminen 2016-09-02 12:42:38 UTC
Created attachment 126171 [details]
Xorg log

(In reply to Michel Dänzer from comment #1)
> Please attach the corresponding Xorg log file.

Attached, didn't seem to have anything significant (only difference to version before the issue is timestamps, build IDs and some additional input device lines).
Comment 3 Michel Dänzer 2016-09-05 06:59:41 UTC
Does the problem also occur with the modesetting driver instead of intel?

If yes, I'm afraid we can't make progress on this bug without seeing at least the number of buffers used before and after my change in the affected cases, preferably also the values of the variables used to decide that number.
Comment 4 Michel Dänzer 2016-09-05 08:27:23 UTC
Created attachment 126214 [details] [review]
Always use at least two buffers

Does this patch fix the problem?
Comment 5 Eero Tamminen 2016-09-05 12:11:12 UTC
(In reply to Michel Dänzer from comment #4)
> Created attachment 126214 [details] [review] [review]
> Always use at least two buffers
> Does this patch fix the problem?

Yes, but that's not all...

BXT performance raised a lot in the affected programs end of last week (some more than the drop, some less than the drop).  This was due Chris' commit adding BXT & KBL PCI IDs to xf86-video-intel. I assume with this, X will be using SNA also on BXT instead of the "default acceleration" that uses (slow) legacy blitter.

So, I did few tests:
* BXT: reverting the "Overhaul dri3_update_num_back" with the new X DDX didn't affect the performance (same good one)
* BXT: with old X DDX attached patch fixes the drops
* BXT: with new X DDX attached patch doesn't give additional improvement
* SNB: attached patch fixes the drops

- This isn't necessarily Mesa issue, but X DDX one
- Something fishy with SNB for which it would be good to get comment from Chris

Still TODO:
-Check modesetting driver perf with the before and after the commit
- Provide requested info from GDB
Comment 6 Eero Tamminen 2016-09-05 15:27:46 UTC
With modesetting:

* BXT: similar to new X DDX, perf OK (somewhat below X DDX)
* SNB: perf still bad, but better than with Intel DDX

-> on SNB the attached patch is needed to get performance on previous level regardless of DDX.

Backtraces are from BXT, both with old (last month) X Intel DDX, before and after the commit.  Earlier dri3_update_num_back() was called when getting buffers, now it's called when swapping.
Comment 7 Eero Tamminen 2016-09-05 15:28:15 UTC
Created attachment 126221 [details]
backtrace before change
Comment 8 Eero Tamminen 2016-09-05 15:28:39 UTC
Created attachment 126222 [details]
backtrace after change
Comment 9 Michel Dänzer 2016-09-06 07:12:59 UTC
Thanks for the information and testing! Fix pushed to master:

Module: Mesa
Branch: master
Commit: dc3bb5db8c81e7f08ae12ea5d3ee999e2afcbfd1
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=dc3bb5db8c81e7f08ae12ea5d3ee999e2afcbfd1

Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Tue Sep  6 11:34:49 2016 +0900

loader/dri3: Always use at least two back buffers
Comment 10 Eero Tamminen 2016-09-07 13:02:17 UTC
Verified.  This affected also KBL (for which Intel DDX still doesn't seem to use SNA) and potentially BYT too.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.