Bug 111090 - 3% perf drop in GfxBench Manhattan 3.0, 3.1 and CarChase test-cases
Summary: 3% perf drop in GfxBench Manhattan 3.0, 3.1 and CarChase test-cases
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-08 15:59 UTC by Eero Tamminen
Modified: 2019-08-15 12:31 UTC (History)
1 user (show)

See Also:
i915 platform: BXT, KBL
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2019-07-08 15:59:04 UTC
Between following drm-tip commits:
* e8f06c34fa: 2019y-05m-27d-14h-41m-23s UTC integration manifest
* 8991a80f85: 2019y-05m-28d-15h-47m-22s UTC integration manifest

There was a performance drop in GfxBench Manhattan 3.0 (gl_manhattan) & 3.1 (gl_manhattan31) and CarChase (gl_4) test-cases:
* testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan

This drop is visible both with onscreen and offscreen tests, both with X server / Unity and Weston.

Performance drop is most clear with Weston / GLES on BXT, regardless of whether test-case runs under Xwayland or is Wayland native, where it's close to 3%.

With X server / GL, perf drop is visible only in Manhattan 3.0 on BXT, and marginally in Manhattan 3.1 on KBL GT3e.

No test-cases have improved their performance in this same time period.

This regression isn't visible on BDW GT2 or SKL GT2 (I have data only from these 5 machines).

Looking at the iowait and RAPL data, at least the Manhattan 3.0 test-case is now  a bit more GPU (IO) bound, and GPU (uncore) uses clearly less power.  I.e. it seems that since end of May, kernel doesn't let Mesa to utilize GPU as fully as before.
Comment 1 Chris Wilson 2019-07-08 16:08:31 UTC
Note that

commit 8ee36e048c98d4015804a23f884be2576f778a93
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jun 20 15:20:52 2019 +0100

    drm/i915/execlists: Minimalistic timeslicing
    
    If we have multiple contexts of equal priority pending execution,
    activate a timer to demote the currently executing context in favour of
    the next in the queue when that timeslice expires. This enforces
    fairness between contexts (so long as they allow preemption -- forced
    preemption, in the future, will kick those who do not obey) and allows
    us to avoid userspace blocking forward progress with e.g. unbounded
    MI_SEMAPHORE_WAIT.

is at least one intentional change in the scheduling that may interfere with throughput tests (like gfxbench) in the presence of third parties (X) also trying to render.

Except that is nowhere need the mentioned commits -- so I will have to check, but if you could keep an eye out for any potential regression from that (such need to be minimised and find some justification).
Comment 2 Chris Wilson 2019-07-08 16:44:11 UTC
Chris Wilson (17):
      drm/i915: Keep user GGTT alive for a minimum of 250ms
      drm/i915: Kill the undead intel_context.c zombie
      drm/i915: Split GEM object type definition to its own header
      drm/i915: Pull GEM ioctls interface to its own file
      drm/i915: Move object->pages API to i915_gem_object.[ch]
      drm/i915: Move shmem object setup to its own file
      drm/i915: Move phys objects to its own file
      drm/i915: Move mmap and friends to its own file
      drm/i915: Move GEM domain management to its own file
      drm/i915: Move more GEM objects under gem/
      drm/i915: Pull scatterlist utils out of i915_gem.h
      drm/i915: Move GEM object domain management from struct_mutex to local
      drm/i915: Move GEM object waiting to its own file
      drm/i915: Move GEM object busy checking to its own file
      drm/i915: Move GEM client throttling to its own file
      drm/i915: Rename intel_context.active to .inflight
      drm/i915: Drop the deferred active reference

Dave Airlie (2):
      Merge tag 'drm-misc-next-2019-05-24' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
      Merge tag 'drm-intel-next-2019-05-24' of git://anongit.freedesktop.org/drm/drm-intel into drm-next

Jani Nikula (1):
      Merge drm/drm-next into drm-intel-next-queued

Michal Wajdeczko (14):
      drm/i915/guc: Change platform default GuC mode
      drm/i915/guc: Don't allow GuC submission
      drm/i915/guc: Updates for GuC 32.0.3 firmware
      drm/i915/guc: Reset GuC ADS during sanitize
      drm/i915/guc: Always ask GuC to update power domain states
      drm/i915/guc: Define GuC firmware version for Geminilake
      drm/i915/huc: Define HuC firmware version for Geminilake
      drm/i915/guc: New GuC interrupt register for Gen11
      drm/i915/guc: New GuC scratch registers for Gen11
      drm/i915/huc: New HuC status register for Gen11
      drm/i915/guc: Update GuC CTB response definition
      drm/i915/guc: Enable GuC CTB communication on Gen11
      drm/i915/guc: Define GuC firmware version for Icelake
      drm/i915/huc: Define HuC firmware version for Icelake

Oscar Mateo (2):
      drm/i915/guc: Create vfuncs for the GuC interrupts control functions
      drm/i915/guc: Correctly handle GuC interrupts on Gen11

Sam Ravnborg (1):
      Merge remote-tracking branch 'drm-intel/topic/core-for-CI' into drm-tip

Uma Shankar (5):
      drm/i915: Enabled Modeset when HDR Infoframe changes
      drm/i915: Add DRM Infoframe handling for BYT/CHT
      drm/i915: Write HDR infoframe and send to panel
      drm/i915: Add state readout for DRM infoframe
      drm/i915: Attach HDR metadata property to connector

Ville Syrjälä (3):
      drm/i915: Make sandybridge_pcode_read() deal with the second data register
      drm/i915: Make sure we have enough memory bandwidth on ICL
      drm/i915: Enable infoframes on GLK+ for HDR


I would say the biggest potential impact there is:
      drm/i915: Move GEM object domain management from struct_mutex to local
      drm/i915: Drop the deferred active reference

Removing struct_mutex should have been an improvement (reducing lock contention, and at worse not making it any worse), so the extra burden of active reference tracking?
Comment 3 Eero Tamminen 2019-07-09 12:57:32 UTC
Lakshmi, although you marked this as BXT, there was some impact visible also on KBL GT3e.  I don't think issue is BXT specific, it's visibility just depends on relative CPU/GPU/bandwidth balance, not absolute performance, and/or how TDP limited the device is.


(In reply to Chris Wilson from comment #1)
>     drm/i915/execlists: Minimalistic timeslicing
...
> is at least one intentional change in the scheduling that may interfere with
> throughput tests (like gfxbench) in the presence of third parties (X) also
> trying to render.

All the impacted tests are complex/heavy ones, and they run in fullscreen under Unity which handles fullscreen properly (unlike Gnome [1]) and disables compositing.  I.e. there's only single process rendering during these benchmarks.


(In reply to Chris Wilson from comment #2)
> I would say the biggest potential impact there is:
>       drm/i915: Move GEM object domain management from struct_mutex to local
>       drm/i915: Drop the deferred active reference
> 
> Removing struct_mutex should have been an improvement (reducing lock
> contention, and at worse not making it any worse), so the extra burden of
> active reference tracking?

In case it helps, these tests have hundreds of drawcalls per frame and don't have a any single bottleneck, bottleneck can change many times during frame.  Unlike the other benchmarks I'm running, the most impacted Manhattan tests use UBOs and transform feedback.

What about drm-tip non-i915 code changes from upstream rebases?  Test machines use P-state power management mode and GPU benchmarks on TDP-limited devices could be impacted also by non-i915 kernel changes.


[1] https://gitlab.gnome.org/GNOME/mutter/issues/60
Comment 4 Chris Wilson 2019-07-09 13:06:10 UTC
Between e8f06c34fa..8991a80f85 outside of i915/ there were a few header changes (to make them selfcontained) and an ALSA merge. Nothing that sets of alarms bells, so I can be reasonably confident that the fault is all mine.
Comment 5 Eero Tamminen 2019-07-09 17:00:50 UTC
Looking at the 1-2% perf regression in Manhattan 3.1 on KBL GT3e, I'm seeing:
* Mot much change in GPU & CPU power usage, but CPU power usage did increase and GPU decrease on KBL too
* Individual frame timings that were before change fluctuating within few tens of percent, have now 3x differences between fastest and slowest frames [1].  BXT doesn't have similar change
* IOWait is now 30-60%, before it was mostly in single digits with only occasional high IOWaits.  I'm not sure whether one can deduct anything from this though, as one earlier run was constantly at 40-70%, with no impact on perf

[1] I'll try to reproduce these and look into KBL frame timings a bit more. If there's anything of interest, I'll mail data to you directly.
Comment 6 Lakshmi 2019-07-10 05:39:29 UTC
(In reply to Eero Tamminen from comment #3)
> Lakshmi, although you marked this as BXT, there was some impact visible also
> on KBL GT3e.  I don't think issue is BXT specific, it's visibility just
> depends on relative CPU/GPU/bandwidth balance, not absolute performance,
> and/or how TDP limited the device is.
Thanks for pointing this Eero. For now, I have added KBL as well.
Comment 7 Chris Wilson 2019-07-10 16:43:28 UTC
First look at GLB30_gl_manhattan on bxt gave results within 1 frame (so less than 0.1%) on e8f06c34fa/8991a80f85 for fullscreen and offscreen with a bare Xorg.

That doesn't help me bisect :-p
Comment 8 Eero Tamminen 2019-07-10 17:30:52 UTC
On BXT, drop is clearly visible (3%) in a daily trend of median for 3 runs, which is in middle of large number of other runs (i.e. definitely TDP limited).  

I can try reproducing it manually.

This is the case where it's most visible on J4205 BXT under X / Ubuntu 18.04 / Unity (and X & Mesa git within few months):
* bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan
* bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan_off

This is the only case where it's visible on KBL GT3e:
* bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan31
* bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_manhattan31_off

Note that both devices have FullHD monitor (i.e. compositor will skip compositing for 1920x1080 fullscreen windows), dual-channel memory [1].  Your results can differ significantly with IGP if you use just single memory channel, as GPU operations can then be much more memory bandwidth bound.

KBL GT3e drop is much smaller, but there's also much less variation, so it's clearly visible in trend too.

[1] 1866Mhz for BXT, 2133Mhz for KBL.  BXT J4205 BIOS is AMI P1.40 (AsRock motherboard), KBL is BNKBL357.86A.0062.2018.0222.1644.
Comment 9 Chris Wilson 2019-07-11 12:40:53 UTC
e8f06c34fa: median of 30 manhattan runs, 1327.5289306640625 (score)
8991a80f85: median of 30 manhattan runs, 1327.635986328125

using ezbench, gfxbench3, bxt J3455, 1920x1080 fullscreen, bare Xorg

perf doesn't suggest any contention in the kernel, and does not appear to be ratelimited by i915.ko submission overhead, i.e. if it was the locking changes I expected that to be reflected in the perf profile of i915.ko. Hmm.

The system does have a pair of dimm, and so I hope dual-channel!
Comment 10 Eero Tamminen 2019-07-11 15:55:13 UTC
(In reply to Chris Wilson from comment #9)
> e8f06c34fa: median of 30 manhattan runs, 1327.5289306640625 (score)
> 8991a80f85: median of 30 manhattan runs, 1327.635986328125

I'm able to reproduce 2-3% Manhattan 3.0 perf drop also manually, when using kernel builds with indicated commits (with yesterday's Mesa & X git versions).


> using ezbench, gfxbench3,

v3?  (Manhattan should work the same in v3, v4 & v5, but it's a difference)


> bxt J3455, 1920x1080 fullscreen, bare Xorg

That's a 12EU BXT with same TDP as 18EU J4205 I'm running.  It's GPU/CPU/memory perf balance is different and it's much less likely to be TDP limited (in case that matters for this).

Do you have a 18EU BXT you could test?


> perf doesn't suggest any contention in the kernel, and does not appear to be
> ratelimited by i915.ko submission overhead, i.e. if it was the locking
> changes I expected that to be reflected in the perf profile of i915.ko. Hmm.

Looking at the BXT ftrace data from successive Manhattan runs, there's something odd.

Manhattan has 3 threads.  Third does nothing, and main thread does just some messaging during benchmarking:
     0.13%  [kernel.kallsyms]   [k] __fget_light
     0.04%  [kernel.kallsyms]   [k] _copy_from_user
     0.03%  [kernel.kallsyms]   [k] fpregs_assert_state_consistent
     0.01%  libpthread-2.29.so  [.] recvmsg
     0.01%  [kernel.kallsyms]   [k] update_rq_clock
     ...

Except for the first frame (which comes from main thread), all other frames come from the second thread:
     1.14%  i965_dri.so              [.] brw_upload_render_state
     1.11%  i965_dri.so              [.] update_stage_texture_surfaces
     1.04%  i965_dri.so              [.] hash_table_search
     1.02%  i965_dri.so              [.] brw_draw_prims
     0.96%  i965_dri.so              [.] isl_gen9_surf_fill_state_s
     0.89%  [kernel.kallsyms]        [k] i915_gem_madvise_ioctl
     0.85%  testfw_app               [.] 0x00000000003d4b28
     0.80%  i965_dri.so              [.] brw_predraw_resolve_inputs
     0.79%  [kernel.kallsyms]        [k] __entry_text_start
    ...

The odd thing is that:
* according to kernel power::cpu_frequency events, one of the four cores is running at much higher frequency than the other cores
* the thread doing buffer swaps (I assume it's one doing rendering), most of the time isn't running on that core, when it does the buffer swap
* because rendering thread is using by far most CPU (50%), I assume it has been on the high speed core (as why that core would otherwise run at high freq?)

=> Why kernel would put rendering thread to low CPU freq core before it does buffer swap?  Transform feedback?

(E.g. unigine Heaven demo has more threads, but does rendering from main thread, and kernel has scheduled that thread on fastest core when it does buffer swap.)


Anyway, this doesn't explain the regression because this behavior is same before and after the regression.  But it may partly explain why GfxBench tests behavior differs from other benchmarks on BXT.

(I don't get CPU freq ftrace events from Core device, so I can't check whether same happens also on KBL.)


PS. regarding the mutex changes during the commit period... Could those be related to the more serious i915 deadlock bug 110848?  Both of these bugs are close together time-wise...
Comment 11 Chris Wilson 2019-07-12 09:01:00 UTC
In a lockstats comparison run, the biggest jump in contention was for page scanning and scheduler runqueues -- suggesting that we are causing more mempressure and cpu switches as the result?

The most significant difference in i915.ko perf is from
 drm/i915: Move GEM object domain management from struct_mutex to local
adding the ww_acquire_context, and a extra batch of work in the many ww_mutex we acquire for serialising the fences during submission.

I still don't think there's anything alarming in the extra kernel overhead there, we are not letting the gpu stall while waiting for the next batch to be submitted (at least on this baby bxt) so I think it's the mempressure/scheduler angle.
Comment 12 Eero Tamminen 2019-07-12 10:23:06 UTC
When going through all the benchmarks data, I noticed that GLBenchmark 2.7 windowed *onscreen* Egypt and T-Rex also regress on J4205 BXT in the same time interval:
$ build_x86_64/binaries/GLBenchmark -skip_load_frames -data data -t GLB27_EgyptHD_inherited_C24Z16_FixedTime -w 1366 -h 768 -ow 1366 -oh 768
$ build_x86_64/binaries/GLBenchmark -skip_load_frames -data data -t GLB27_TRex_C24Z16_FixedTimeStep_Offscreen -w 1366 -h 768 -ow 1366 -oh 768

Regression is 4-6%, but one sees it clearly only from trend because GLB tests have a lot of variance (due to being also slightly CPU bound, and their FPS is reported as integers).

This regression isn't visible on the other machines (BDW GT2, SKL GT2, KBL GT3e).
Comment 13 Chris Wilson 2019-08-02 23:15:26 UTC
Wrt the jump in shrinker activity, I hope 

commit 1aff1903d0ff53f055088a77948ac8d8224d42db
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Aug 2 22:21:36 2019 +0100

    drm/i915: Hide unshrinkable context objects from the shrinker
    
    The shrinker cannot touch objects used by the contexts (logical state
    and ring). Currently we mark those as "pin_global" to let the shrinker
    skip over them, however, if we remove them from the shrinker lists
    entirely, we don't event have to include them in our shrink accounting.
    
    By keeping the unshrinkable objects in our shrinker tracking, we report
    a large number of objects available to be shrunk, and leave the shrinker
    deeply unsatisfied when we fail to reclaim those. The shrinker will
    persist in trying to reclaim the unavailable objects, forcing the system
    into a livelock (not even hitting the dread oomkiller).
    
    v2: Extend unshrinkable protection for perma-pinned scratch and guc
    allocations (Tvrtko)
    v3: Notice that we should be pinned when marking unshrinkable and so the
    link cannot be empty; merge duplicate paths.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190802212137.22207-1-chris@chris-wilson.co.uk

helps (by making the shrinker less aggressive towards us)
Comment 14 Eero Tamminen 2019-08-13 12:58:47 UTC
Between following drm-tip kernel commits:
8bb86058e9 at 2019-08-09 15:06:19: 2019y-08m-09d-15h-05m-28s UTC integration manifest
cfacb6e14c at 2019-08-10 14:01:32: 2019y-08m-10d-14h-00m-41s UTC integration manifest

There were following perf changes on SKL GT2 & KBL GT3e:
* SynMark VSTangent improved by 10%
* GLB 2.7 Egypt improved by 1-2%
* GLB 2.7 T-Rex improved by < 1%
* SynMark ZBuffer decreased by 5% KBL GT3e and 3% on SKL GT2 (*)

(SynMark is FullHD fullscreen, GLB is 1/2 screen windowed.)

Neither BDW GT2 nor BXT perf changed.  For other things I don't have data.

(GfxBench Manhattan tests perf has improved on BXT sometime since mid-July, but I don't have data on whether it's due to kernel or Mesa.)


(*) ZBuffer text draws a lot of trivial triangles on top of each other, so it tests depth read/write and render write bandwidth + draw overhead.
Comment 15 Chris Wilson 2019-08-13 17:26:48 UTC
(In reply to Eero Tamminen from comment #14)
> Between following drm-tip kernel commits:
> 8bb86058e9 at 2019-08-09 15:06:19: 2019y-08m-09d-15h-05m-28s UTC integration
> manifest
> cfacb6e14c at 2019-08-10 14:01:32: 2019y-08m-10d-14h-00m-41s UTC integration
> manifest
> 
> There were following perf changes on SKL GT2 & KBL GT3e:
> * SynMark VSTangent improved by 10%
> * GLB 2.7 Egypt improved by 1-2%
> * GLB 2.7 T-Rex improved by < 1%
> * SynMark ZBuffer decreased by 5% KBL GT3e and 3% on SKL GT2 (*)

Principal change there:
      drm/i915/gtt: enable GTT cache by default
      drm/i915/gtt: disable 2M pages for pre-gen11

Other noteworthy (but unlikely to impact):
      dma-buf: make dma_fence structure a bit smaller v2
      dma-buf: add reservation_object_fences helper
      drm/i915: use new reservation_object_fences helper
      dma-buf: further relax reservation_object_add_shared_fence
Comment 16 Eero Tamminen 2019-08-15 12:31:55 UTC
(In reply to Eero Tamminen from comment #14)
> There were following perf changes on SKL GT2 & KBL GT3e:
> * SynMark VSTangent improved by 10%
> * GLB 2.7 Egypt improved by 1-2%
> * GLB 2.7 T-Rex improved by < 1%
> * SynMark ZBuffer decreased by 5% KBL GT3e and 3% on SKL GT2 (*)

On BXT, ZBuffer perf had actually also increased (from what it was month ago, as I don't have closer "before" data), and on BDW GT2 there was no change in it (although other listed cases did improve also on BDW GT2).

I.e. I think ZBuffer drop on SKL / KBL can be ignored.  It's been only significant kernel perf change since this bug, besides the one around 22nd of June, with 5% improvement in SynMark Batch0, 1-4% in VS* and 4% perf drop in SynMark GSCloth & ShMapPcf on SKL GT2 (and marginal changes on other core devices).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.