Summary: | [BDW] OglDrvCtx performance reduced by ~30% after use true PPGTT in Gen8+ | ||
---|---|---|---|
Product: | DRI | Reporter: | wendy.wang |
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> |
Status: | CLOSED FIXED | QA Contact: | Jairo Miramontes <jairo.daniel.miramontes.caton> |
Severity: | critical | ||
Priority: | high | CC: | christophe.prigent, eero.t.tamminen, intel-gfx-bugs |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BDW | i915 features: | GEM/PPGTT |
Description
wendy.wang
2014-12-26 01:21:55 UTC
It's expected, atm. Deferred pagetable allocation should improve matters, hopefully to the point where the VM creation overhead is negligible. (In reply to Chris Wilson from comment #1) > It's expected, atm. Deferred pagetable allocation should improve matters, > hopefully to the point where the VM creation overhead is negligible. Chris, are we there yet? Patches have been posted at least once. :| old regression issue with no update for around a year and with a fix that went upstream, so closing. If this is still an issue with current drm-intel-nightly please reopen. The regression has not been fixed yet. Chris, do we have a plan to solve it? Do you need any help from me or my team? There are a number of patches to address this, with review outstanding. The regression here is due to the large number of context and list entries generated by OglDrvCtx. The patches try to compensate by making context creation quicker (the goal of that particular microbenchmark) and introduce hashtables to avoid the linear list walks. One step forward, several back in the meantime: commit db6c2b4151f2915fe1695cdcac43b32e73d1ad32 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 1 11:54:00 2016 +0000 drm/i915: Store the vma in an rbtree under the object With full-ppgtt one of the main bottlenecks is the lookup of the VMA underneath the object. For execbuf there is merit in having a very fast direct lookup of ctx:handle to the vma using a hashtree, but that still leaves a large number of other lookups. One way to speed up the lookup would be to use a rhashtable, but that requires extra allocations and may exhibit poor worse case behaviour. An alternative is to use an embedded rbtree, i.e. no extra allocations and deterministic behaviour, but at the slight cost of O(lgN) lookups (instead of O(1) for rhashtable). The major of such tree will be very shallow and so not much slower, and still scales much, much better than the current unsorted list. Fwiw, we are about half-way through the outstanding patches to fix (as far as we can) in the driver. There will be some remaining overhead from switching pagetables in the hw that we cannot hide. (In reply to Chris Wilson from comment #7) > There are a number of patches to address this, with review outstanding. The > regression here is due to the large number of context and list entries > generated by OglDrvCtx. The patches try to compensate by making context > creation quicker (the goal of that particular microbenchmark) and introduce > hashtables to avoid the linear list walks. Note: we stopped tracking that (SynMark v6.0) test last year when switching to SynMark v7.0 which didn't anymore include it. For perf test it had unreasonable amount of variance and it was acting more like a functional test-case for kernel & compositor (particularly Unity/compiz), bugs that would IMHO be better tracked by a specific, functional test-cases e.g. in IGT. From recollection, OglDrvCtx had perf regressions (and distinct rate-limiting steps) in mesa, dri[2,3] and the kernel. It was basically a glClear benchmark across lots of contexts, which in many respects shows the same characteristics as lots of clients. That is a level of integration testing above igt. Showing the scaling characteristics of the kernel wrt to the number of contexts (be they one fd or many) does have coverage in igt. (In reply to Chris Wilson from comment #11) > From recollection, OglDrvCtx had perf regressions (and distinct > rate-limiting steps) in mesa, dri[2,3] and the kernel. And when run for longer time, also in Unity/compiz (Ubuntu bug tracker has bug for per-window resource leak which was opened many years ago and still unfixed). > It was basically a glClear benchmark across lots of contexts, > which in many respects shows the same characteristics as lots of clients. It creates a context, compiles a most trivial shader program, does color buffer clear, and draws single triangle, swaps it on screen and deletes shaders & context, on every frame. There's only single context live at any given time i.e. it simulates lot of successive (windowed & composited) clients, not parallel ones. Because it's most trivial use-case possible, and window is fairly small, it's only partly memory bandwidth (clear) bound. > That is a level of integration testing above igt. > Showing the scaling characteristics of the kernel wrt to > the number of contexts (be they one fd or many) does have coverage in igt. If you don't think relevant parts of that test couldn't be added to igt, but you think it's still useful, I think there should be some other open source implementation for that kind of "stress" test (it should not be left for old version of Intel *internal* test-suite). PS. Something I forgot to say earlier. As we aren't anymore tracking this, I don't know who will verify the fix. Priority changed to High+Critical. Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included I know Chris that you say that we are half a way to fixing the regression, but measurements with OglDrvCtx (from Synmark 7) done with Ubuntu kernels are showing that one might have recovered from the regression reported dec-2014. If one sets Aug-2015+fixes as the baseline (100%) then one : 4.2.0-42-generic - Aug-2015+fixes - 100% 4.4.0-21-generic - Jan-2016+fixes - 187% 4.8.0-46-generic - Oct-2016+fixes - 173% 4.10.0-19-generic - Feb-2017+fixes - 206% So I will remove regression related tags from this issue, and proposing this to be closed as fixed. Chris+Jani+Eero - if you agree then please change status to closed, if not then please set status=REOPENED. You would need to measure current mesa with execlists=0 and ppgtt=1 to get the current baseline. Baseline: i915.enable_execlists=0 i915.enable_ppgtt=1 # Not sure if i915.enable_execlists=0 i915.enable_ppgtt=2 is functional enough to # test 1: i915.enable_execlists=1 i915.enable_ppgtt=1 2: i915.enable_execlists=1 i915.enable_ppgtt=2 The biggest improvement in the kernel for this bug was from using an rbtree for vma lookup, and improving the mechanisms for writing the GTT. (Almost) Everything I had planned for this bug (since ~30 months ago) is almost in the kernel, the last dregs can be found here: https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=prescheduler (down to 37 patches!!!). The remaining features are O(1) lookup for vma from a context (i.e. excellent scaling of execbuf to very large number of contexts), context state caching. (In reply to Jari Tahvanainen from comment #15) > I know Chris that you say that we are half a way to fixing the regression, > but measurements with OglDrvCtx (from Synmark 7) done with Ubuntu kernels OglDrvCtx was dropped from SynMark 7, it's only available in earlier versions. Next stage of full-ppgtt scalability: commit 4ff4b44cbb70c269259958cbcc48d7b8a2cb9ec8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Jun 16 15:05:16 2017 +0100 drm/i915: Store a direct lookup from object handle to vma |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.