87725 – [BDW] OglBatch7 performance reduced by ~7% after enable execlists

Bug 87725 - [BDW] OglBatch7 performance reduced by ~7% after enable execlists

Summary: [BDW] OglBatch7 performance reduced by ~7% after enable execlists

Status:	CLOSED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	high critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Jairo Miramontes

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2014-12-26 01:03 UTC by wendy.wang
Modified:	2018-01-05 16:58 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	BDW
i915 features:	GEM/execlists

Attachments
dmesg file (107.62 KB, text/plain) 2014-12-26 01:03 UTC, wendy.wang	no flags	Details
View All

Description wendy.wang 2014-12-26 01:03:12 UTC

Created attachment 111341 [details]
dmesg file

Environment:
-----------------------------------
Platform: BDW

Libdrm:                         (master)libdrm-2.4.58-19-gf99522e678dbbaffeca9462a8edcbe900574dc12
Mesa:                           (master)ff9653775962ab661c4891721b1b93d077d1f2db
Xserver:                       (master)xorg-server-1.16.99.901-85-g91651e7c15892aa846fc406fbb13b37f094dd3f0
Xf86_video_intel:                        (master)2.99.916-172-g04a09d353fc5ad8a55eb2457dc8bb43638ba879e
Cairo:                            (master)4a225fca5f121c31ddaa0af80a13bf95a7f21a5b
Libva:                            (master)cdfd3d50d80c092ca3ae8914418be628d5a80832
Libva_intel_driver:                      (master)f2a34f94c57e1f7ce975b2068fb087df84b92e3a
Kernel:   (drm-intel-nightly)a243fbbcfd348ec042b100f3ce02f215252d177d


Bug detailed description:
---------------------------------------------
OglBatch7 performance reduced by ~7% after enable execlists

Bisect drm-intel-next-queued kernel branch, the first bad commit is:

27401d126b5b1c8dd4df98bbb60b09ff2b3d5e60
Author: Thomas Daniel <thomas.daniel@intel.com>
Date:   Thu Dec 11 12:48:35 2014 +0000

    drm/i915/bdw: Enable execlists by default where supported

    Execlist support in the i915 driver is now considered good enough for the
    feature to be enabled by default on Gen8 and later and routinely tested.
    Adjusted i915 parameters structure initialization to reflect this and updated
    the comment in intel_sanitize_enable_execlists().

    There's still work to do before we can let the wider massive onto it,
    but there's still time left before the 3.20 cutoff.

    v2: Update the MODULE_PARM_DESC too.

    Issue: VIZ-2020
    Signed-off-by: Thomas Daniel <thomas.daniel@intel.com>
    [danvet: Add note that there's still some work left to do.]
    Signed-off-by: Daniel Vetter daniel.vetter@ffwll.ch


Reproduce steps:
---------------------------------------------
1,   xinit &    
2,   cd /usr/local/games/OpenGL/SynMark2_6
3.   ./synmark2 OglBatch7

Dmesg file have been attached.

Comment 1 Eero Tamminen 2014-12-29 12:20:35 UTC

Batch7 tests does (unrealistically) large number of batches within frame, so that it's completely CPU bound.  It's kind of worst case for this.

Wendy, is there any tests which improved with this commit?

Comment 2 wendy.wang 2014-12-31 06:16:58 UTC

Did not observe other cases perf data improve on BDW GT2(E2 or F0)

Comment 3 Jesse Barnes 2015-03-10 19:13:51 UTC

Wendy or Eero, do we have a profile for this case?  If the test is CPU bound, does that mean our new execlist submission path has a lot more overhead than our ring buffer one?

Comment 4 wendy.wang 2015-03-16 08:42:26 UTC

(In reply to Jesse Barnes from comment #3)
> Wendy or Eero, do we have a profile for this case?  If the test is CPU
> bound, does that mean our new execlist submission path has a lot more
> overhead than our ring buffer one?

Sorry Jesse, I do not have the profile for the OglBatch7 case.

Comment 5 Daniel Vetter 2015-03-18 11:23:27 UTC

Sounds like this is wontfix.

Comment 6 Chris Wilson 2015-03-18 11:27:06 UTC

Why not? It's the extra interrupt overhead in the driver which we can do our best to reduce. In the ideal case, execlists only increases CPU overhead and not introduce GPU stalls like here.

Comment 7 Chris Wilson 2015-03-22 15:55:24 UTC

I don't understand why we wouldn't be interested in fixing the inherent issues with execlists.

Comment 8 Chris Wilson 2015-03-23 16:09:19 UTC

Looking at my bdw-u system, the effect here seems to be attributable to the increased latency of execlist submission to an idle GPU compared to legacy - and that we are doing just enough work inside the execlist handler for the GPU to become idle between batches. On my machine I can get comparable timings between execlists and legacy by tuning the driver, but if my hypothesis is valid such tuning will be highly dependent upon the system.

Comment 9 Jesse Barnes 2015-03-23 16:46:01 UTC

So are we not submitting work to the second submit port?  If we do, and we re-submit it on the first port, I'm assuming it won't preempt it, dump the state, and then re-load it when we submit new work to the second port.  With some more detail and experimentation hopefully we can get this down to 0% or at least in the noise.

Comment 10 Chris Wilson 2015-03-23 21:15:28 UTC

There isn't enough work to submit pairs of batches - we spend, between synmark, mesa and the driver, longer to create the batch for execution than it takes the GPU to execute it. This test really does focus on submission overhead. I think I have it mostly down to the noise now, or at least down to the level where even I won't whinge too much. (There is definitely room for improvement wrt to the hardware responding to ELSP so hopefully it will be less of an issue in future gen.)

Comment 11 Eero Tamminen 2015-03-25 10:49:40 UTC

Wendy provided me data with execlists enabled (default) and disabled (i915.enable_execlists=0 kernel command line option), for full SynMark test-set with recent 3D stack.

Execlists is in no case faster.

It's ~50% slower in GL context re-creation test, and in other CPU bound tests it's 5-10% slower:
- Batch5-7 (batching huge triangle strip with ever increasing number of draw calls)
- DrvState (mostly CPU bound GL state change test)
- Multithread (GL calls from multiple threads)

Same difference both on BSW & BDW.

Comment 12 Chris Wilson 2015-03-25 11:58:24 UTC

(In reply to Eero Tamminen from comment #11)
> Wendy provided me data with execlists enabled (default) and disabled
> (i915.enable_execlists=0 kernel command line option), for full SynMark
> test-set with recent 3D stack.
> 
> Execlists is in no case faster.
> 
> It's ~50% slower in GL context re-creation test, and in other CPU bound

Hmm, not looked at that. The cause is fairly obvious since execlists implies even more state to create per context. I think I can safely eliminate most of the extra overhead. Worth filing that as a separate bug (I guess the report got overlooked due to the first regression from enabling ppgtt).

> tests it's 5-10% slower:
> - Batch5-7 (batching huge triangle strip with ever increasing number of draw
> calls)

These are what I've been looking at. There is a lot of low hanging fruit we can eliminate in execlist submission that brings us back on par. However, looking at the profile of igt/gem_exec_nop and igt/gem_exec_blt, I came to the conclusion that the largest effect is when the GPU is idle (i.e. the GPU takes longer to respond to ELSP than a TAIL update). Therefore the biggest impact of execlists will be different on different machines depending on just how frequently the GPU stalls waiting for the driver.

> - DrvState (mostly CPU bound GL state change test)
> - Multithread (GL calls from multiple threads)

I presume these will be more or less the same effect as above for execlist submission overheads. Except we will be exercising context switching (and thus ring switching) much more. In the driver we can amalgamate consecutive batches from the same context into a single submission, with more contexts active we will not only have to do more ELSP submissions (i.e. more irq handling and general driver overheads) but also more latency from hw context switches.

Comment 13 yann 2016-08-04 10:31:41 UTC

Do we have any update on this old regression? Looks like it may require some non negliegeable work

Comment 14 Chris Wilson 2016-08-04 11:02:15 UTC

It's more or less a fundamental difference in hw implementation between execlists and legacy submission. For whatever reason, execution latency from idle is much worse under execlists, and OglBatch7 hits that path continuously (since the overhead in the driver is sufficient that mesa cannot keep the GPU busy). However, some of this is self inflicted as execlists incurs greater execution overhead, both from per-bb/per-ctx instructions and from being irq driven. Those we can improve and have been improving, but that execution latency is hard to hide.

Comment 15 yann 2017-03-24 13:17:53 UTC

Reference to Chris' patch set: https://patchwork.freedesktop.org/series/21821/

Comment 16 Chris Wilson 2017-03-24 15:14:12 UTC

To put the difference into perspective, a couple of microbenchmarks (bdw i7-5557u):
                                        ring          execlists
exec continuous nops on all rings:   1.491us            2.223us
exec sequential nops on each ring:  12.508us           53.682us
single nop + sync:                   9.272us           30.291us

vblank_mode=0 glxgears:            ~11000fps           ~9000fps

Comment 17 Jari Tahvanainen 2017-04-18 12:00:04 UTC

Priority changed to High+Critical.

Comment 18 Ricardo 2017-05-09 17:31:08 UTC

Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 19 Jari Tahvanainen 2017-05-16 08:07:51 UTC

I removed regression related tags from this issue. Based on the comment 16 and measurements done with OglBatch7 on Ubuntu kernels one can state that gap caused by change from ring to execlists still exist.
If one sets Aug-2015+fixes as the baseline (100%) then:
4.2.0-42-generic - Aug-2015+fixes - 100%
4.10.0-19-generic - Feb-2017+fixes ~ 80%
This is in line with glxgears numbers presented by Chris on comment 16.

Comment 20 Elizabeth 2017-08-04 19:32:04 UTC

Hello everyone,
This seems to need more time to make progress, so I'll change the status to NEEDINFO while any new update is added to this case. Once updated please change to REOPEN.
Thank you.

Comment 21 Chris Wilson 2017-11-23 21:59:11 UTC

Minor improvements, but still the disparity remains. However,

commit 79e6770cb1f5e32eb49590edbb794a97d0113aed
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 20 20:55:01 2017 +0000

    drm/i915: Remove obsolete ringbuffer emission for gen8+
    
    Since removing the module parameter to force selection of ringbuffer
    emission for gen8, the code is defunct. Remove it.
    
    To put the difference into perspective, a couple of microbenchmarks
    (bdw i7-5557u, 20170324):
                                            ring          execlists
    exec continuous nops on all rings:   1.491us            2.223us
    exec sequential nops on each ring:  12.508us           53.682us
    single nop + sync:                   9.272us           30.291us
    
    vblank_mode=0 glxgears:            ~11000fps           ~9000fps
    
    Since the earlier submission, gen8 ringbuffer submission has fallen
    further and further behind in features. So while ringbuffer may hold the
    throughput crown, in terms of interactive latency, execlists is much
    better. Alas, we have no convenient metrics for such, other than
    demonstrating things we can do with execlists but can not using
    legacy ringbuffer submission.
    
    We have made a few improvements to lowlevel execlists throughput,
    and ringbuffer currently panics on boot! (bdw i7-5557u, 20171026):
    
                                            ring          execlists
    exec continuous nops on all rings:       n/a            1.921us
    exec sequential nops on each ring:       n/a           44.621us
    single nop + sync:                       n/a           21.953us
    
    vblank_mode=0 glxgears:                  n/a          ~18500fps
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=87725
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Once-upon-a-time-Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-2-chris@chris-wilson.co.uk

commit fb5c551ad510e4a408c105670f89c725ebbfe6c8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 20 20:55:00 2017 +0000

    drm/i915: Remove i915.enable_execlists module parameter
    
    Execlists and legacy ringbuffer submission are no longer feature
    comparable (execlists now offer greater functionality that should
    overcome their performance hit) and obsoletes the unsafe module
    parameter, i.e. comparing the two modes of execution is no longer
    useful, so remove the debug tool.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Lionel Landwerlin <lionel.g.landwerlin@intel.com> #i915_perf.c
    Link: https://patchwork.freedesktop.org/patch/msgid/20171120205504.21892-1-chris@chris-wilson.co.uk

Comment 22 Eero Tamminen 2017-11-24 08:18:48 UTC

I wonder what impact resource streamer would have on this?

(This is not a request to add RS, as it goes away in newer GENs, but I know that Windows uses it, so I was just wondering how much impact you think it would have had.)

Comment 23 Chris Wilson 2017-11-24 08:48:50 UTC

(In reply to Eero Tamminen from comment #22)
> I wonder what impact resource streamer would have on this?
> 
> (This is not a request to add RS, as it goes away in newer GENs, but I know
> that Windows uses it, so I was just wondering how much impact you think it
> would have had.)

Overall, none. RS afaik is the same for both submission paths; the effect causing the slowdown is the order of magnitude increased latency in both the initial submission and effectively doing a context switch every time (waking up, reprogramming ELSP, which in turn triggers another wakeup in a few microseconds, repeat until steady state is achieved, then start again at the end of the batch).

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.