Bug 87726

Summary:	[BDW] OglDrvCtx performance reduced by ~30% after use true PPGTT in Gen8+
Product:	DRI	Reporter:	wendy.wang
Component:	DRM/Intel	Assignee:	Chris Wilson <chris>
Status:	CLOSED FIXED	QA Contact:	Jairo Miramontes <jairo.daniel.miramontes.caton>
Severity:	critical
Priority:	high	CC:	christophe.prigent, eero.t.tamminen, intel-gfx-bugs
Version:	unspecified
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BDW	i915 features:	GEM/PPGTT

Description wendy.wang 2014-12-26 01:21:55 UTC

Environment:
-----------------------------------
Platform: BDW

 Libdrm:		(master)libdrm-2.4.58-19-gf99522e678dbbaffeca9462a8edcbe900574dc12
 Mesa:		(master)873d7351c5a696b6f8493c907eb8c76b8cb632da
 Xserver:		(master)xorg-server-1.16.99.901-98-g0f5fdaf600bfeada966aea942cb3e347d4efee30
 Xf86_video_intel:		(master)2.99.916-177-g99537089346ed7a1bc6b02349fad93845e865bb9
 Cairo:		(master)4a225fca5f121c31ddaa0af80a13bf95a7f21a5b
 Libva:		(master)8986ec692b19d8dd6bd2aa118b5dffbd05a8f909
 Libva_intel_driver:		(master)b5d6d9d425a6d539b27d22992bda05f79d1a0622
 Kernel:   (drm-intel-nightly)8d22327ae006186bad9f95c300fad5ecd5e961a4

Bug detailed description:
---------------------------------------------
 OglDrvCtx performance reduced by ~30% after use true PPGTT in Gen8+

Bisect drm-intel-next-queued kernel branch, the first bad commit is:

2f82bbdf3d4f1361c3d713c516d8aa390102374d is the first bad commit
commit 2f82bbdf3d4f1361c3d713c516d8aa390102374d
Author: Michel Thierry <michel.thierry@intel.com>
Date: Mon Dec 15 14:58:00 2014 +0000

    drm/i915: Use true PPGTT in Gen8+ when execlists are enabled

    In Gen8+, full ppgtt needs execlist, otherwise the ctx switch can hang.

    Also remove the current restriction, a user should be able to explicitly set
    ppgtt=2.



Reproduce steps:
---------------------------------------------
1,   xinit &    
2,   cd /usr/local/games/OpenGL/SynMark2_6
3.   ./synmark2 OglDrvCtx

Comment 1 Chris Wilson 2015-01-06 10:48:29 UTC

It's expected, atm. Deferred pagetable allocation should improve matters, hopefully to the point where the VM creation overhead is negligible.

Comment 2 Jani Nikula 2015-10-23 09:42:22 UTC

(In reply to Chris Wilson from comment #1)
> It's expected, atm. Deferred pagetable allocation should improve matters,
> hopefully to the point where the VM creation overhead is negligible.

Chris, are we there yet?

Comment 3 Chris Wilson 2015-10-23 09:49:14 UTC

Patches have been posted at least once. :|

Comment 4 yann 2016-08-04 10:24:51 UTC

old regression issue with no update for around a year and with a fix that went upstream, so closing. If this is still an issue with current drm-intel-nightly please reopen.

Comment 5 Chris Wilson 2016-08-04 10:27:17 UTC

The regression has not been fixed yet.

Comment 6 yann 2016-08-04 10:32:35 UTC

Chris, do we have a plan to solve it? Do you need any help from me or my team?

Comment 7 Chris Wilson 2016-08-04 10:58:04 UTC

There are a number of patches to address this, with review outstanding. The regression here is due to the large number of context and list entries generated by OglDrvCtx. The patches try to compensate by making context creation quicker (the goal of that particular microbenchmark) and introduce hashtables to avoid the linear list walks.

Comment 8 Chris Wilson 2016-11-01 13:44:45 UTC

One step forward, several back in the meantime:

commit db6c2b4151f2915fe1695cdcac43b32e73d1ad32
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Nov 1 11:54:00 2016 +0000

    drm/i915: Store the vma in an rbtree under the object
    
    With full-ppgtt one of the main bottlenecks is the lookup of the VMA
    underneath the object. For execbuf there is merit in having a very fast
    direct lookup of ctx:handle to the vma using a hashtree, but that still
    leaves a large number of other lookups. One way to speed up the lookup
    would be to use a rhashtable, but that requires extra allocations and
    may exhibit poor worse case behaviour. An alternative is to use an
    embedded rbtree, i.e. no extra allocations and deterministic behaviour,
    but at the slight cost of O(lgN) lookups (instead of O(1) for
    rhashtable). The major of such tree will be very shallow and so not much
    slower, and still scales much, much better than the current unsorted
    list.

Comment 9 Chris Wilson 2017-01-23 09:03:11 UTC

Fwiw, we are about half-way through the outstanding patches to fix (as far as we can) in the driver. There will be some remaining overhead from switching pagetables in the hw that we cannot hide.

Comment 10 Eero Tamminen 2017-02-17 09:11:08 UTC

(In reply to Chris Wilson from comment #7)
> There are a number of patches to address this, with review outstanding. The
> regression here is due to the large number of context and list entries
> generated by OglDrvCtx. The patches try to compensate by making context
> creation quicker (the goal of that particular microbenchmark) and introduce
> hashtables to avoid the linear list walks.

Note: we stopped tracking that (SynMark v6.0) test last year when switching to  SynMark v7.0 which didn't anymore include it.  For perf test it had unreasonable amount of variance and it was acting more like a functional test-case for kernel & compositor (particularly Unity/compiz), bugs that would IMHO be better tracked by a specific, functional test-cases e.g. in IGT.

Comment 11 Chris Wilson 2017-02-17 09:20:43 UTC

From recollection, OglDrvCtx had perf regressions (and distinct rate-limiting steps) in mesa, dri[2,3] and the kernel. It was basically a glClear benchmark across lots of contexts, which in many respects shows the same characteristics as lots of clients. That is a level of integration testing above igt. Showing the scaling characteristics of the kernel wrt to the number of contexts (be they one fd or many) does have coverage in igt.

Comment 12 Eero Tamminen 2017-02-17 10:22:50 UTC

(In reply to Chris Wilson from comment #11)
> From recollection, OglDrvCtx had perf regressions (and distinct
> rate-limiting steps) in mesa, dri[2,3] and the kernel.

And when run for longer time, also in Unity/compiz (Ubuntu bug tracker has bug for per-window resource leak which was opened many years ago and still unfixed).


> It was basically a glClear benchmark across lots of contexts,
> which in many respects shows the same characteristics as lots of clients.

It creates a context, compiles a most trivial shader program, does color buffer clear, and draws single triangle, swaps it on screen and deletes shaders & context, on every frame. There's only single context live at any given time i.e. it simulates lot of successive (windowed & composited) clients, not parallel ones.

Because it's most trivial use-case possible, and window is fairly small, it's only partly memory bandwidth (clear) bound.


> That is a level of integration testing above igt.
> Showing the scaling characteristics of the kernel wrt to
> the number of contexts (be they one fd or many) does have coverage in igt.

If you don't think relevant parts of that test couldn't be added to igt, but you think it's still useful, I think there should be some other open source implementation for that kind of "stress" test (it should not be left for old version of Intel *internal* test-suite).

PS. Something I forgot to say earlier.  As we aren't anymore tracking this, I don't know who will verify the fix.

Comment 13 Jari Tahvanainen 2017-04-18 12:01:22 UTC

Priority changed to High+Critical.

Comment 14 Ricardo 2017-05-09 17:31:59 UTC

Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 15 Jari Tahvanainen 2017-05-16 07:43:58 UTC

I know Chris that you say that we are half a way to fixing the regression, but measurements with OglDrvCtx (from Synmark 7) done with Ubuntu kernels are showing that one might have recovered from the regression reported dec-2014. If one sets Aug-2015+fixes as the baseline (100%) then one :
4.2.0-42-generic - Aug-2015+fixes - 100%
4.4.0-21-generic - Jan-2016+fixes - 187%
4.8.0-46-generic - Oct-2016+fixes - 173%
4.10.0-19-generic - Feb-2017+fixes - 206%
So I will remove regression related tags from this issue, and proposing this to be closed as fixed. Chris+Jani+Eero - if you agree then please change status to closed, if not then please set status=REOPENED.

Comment 16 Chris Wilson 2017-05-16 08:06:55 UTC

You would need to measure current mesa with execlists=0 and ppgtt=1 to get the current baseline. 

Baseline: i915.enable_execlists=0 i915.enable_ppgtt=1
# Not sure if i915.enable_execlists=0 i915.enable_ppgtt=2 is functional enough to # test
1: i915.enable_execlists=1 i915.enable_ppgtt=1
2: i915.enable_execlists=1 i915.enable_ppgtt=2

The biggest improvement in the kernel for this bug was from using an rbtree for vma lookup, and improving the mechanisms for writing the GTT. (Almost) Everything I had planned for this bug (since ~30 months ago) is almost in the kernel, the last dregs can be found here: https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=prescheduler (down to 37 patches!!!).

The remaining features are O(1) lookup for vma from a context (i.e. excellent scaling of execbuf to very large number of contexts), context state caching.

Comment 17 Eero Tamminen 2017-05-16 08:13:05 UTC

(In reply to Jari Tahvanainen from comment #15)
> I know Chris that you say that we are half a way to fixing the regression,
> but measurements with OglDrvCtx (from Synmark 7) done with Ubuntu kernels

OglDrvCtx was dropped from SynMark 7, it's only available in earlier versions.

Comment 18 Chris Wilson 2017-06-16 16:21:14 UTC

Next stage of full-ppgtt scalability:

commit 4ff4b44cbb70c269259958cbcc48d7b8a2cb9ec8
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 16 15:05:16 2017 +0100

    drm/i915: Store a direct lookup from object handle to vma

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.