Bug 107410 - [SKL] Up to 20% performance regression in GpuTest Triangle, due to 2x higher CPU power usage
Summary: [SKL] Up to 20% performance regression in GpuTest Triangle, due to 2x higher ...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-27 14:28 UTC by Eero Tamminen
Modified: 2018-08-23 10:03 UTC (History)
3 users (show)

See Also:
i915 platform: SKL
i915 features:


Attachments
Try these steps (507 bytes, patch)
2018-07-28 18:20 UTC, Srinivas Pandruvada
no flags Details | Splinter Review

Description Eero Tamminen 2018-07-27 14:28:30 UTC
Between following drm-tip commits:
2018-06-16 13:01:59 UTC d4b21cf9ff "2018y-06m-16d-13h-00m-25s UTC integration manifest"
2018-06-17 12:43:31 UTC e47233f783 "2018y-06m-17d-12h-42m-13s UTC integration manifest"

There were following 3D benchmark performance regressions on SKL GT4e:
- 20% GpuTest Triangle (windowed & composited, half screen size)
- 10% GpuTest Triangle (FullHD fullscreen i.e. slower)
- 1-6% SynMark Batch[2-4]
- 1-4% GfxBench Tessellation (onscreen only), GpuTest Julia32 (windowed), SynMark GSCloth & ShMapPcf (fullscreen)
  (There may also be smaller regression in SynMark CSCloth & ShMapVsm)

SKL GT2 shows regression in same cases, but they're somewhat smaller.

There were no improvements in any 3D benchmarks from this, but I noticed small increase in SIMD CPU copy and large increase in SIMD CPU read performance.  However, that was only on SKL GT2, not GT4e, so it may be unrelated.


When looking at the RAPL & CAGF data from before and after the regression on SKL GT4e for the GpuTop fullscreen Triangle case:
1. *CPU power usage doubled*
2. GPU power use decreased a bit
3. GPU freq drops from (max) 950 to 800 Mhz, but temperature is OK
=> kernel has turned this use-case from GPU to CPU/TDP bound

SKL GT2 isn't TDP limited, and there one sees only effect 1), besides the FPS drop, i.e. there the use-case became more CPU bound, but not enough to drop GPU freq.


Interestingly, these regression are visible only on:
- SKL GT2  (i5-6600K)
- SKL GT4e (i7-6770HQ "SkullCanyon")

*not* on:
- KBL GT2  (i5-7500U)
- KBL GT3e (i7-7567U)
- SKL GT3e (i5-6260U)
- CFL-S (pre-production)

On those devices there were no changes in CPU power usage (nor performance).
Comment 1 Chris Wilson 2018-07-27 14:59:38 UTC
That window pulled in 4.18-rc1. It is likely to be the changes around setting cpu frequencies around iowaits.
Comment 2 Chris Wilson 2018-07-27 15:01:14 UTC
Of particular interest, I guess,

commit d09fcecb0c797b884ce65daa37c121a2786bb17b
Merge: f5b7769eb040 6a900f884e3e
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Wed Jun 13 07:24:18 2018 -0700

    Merge tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
    
    Pull more power management updates from Rafael Wysocki:
     "These revert a recent PM core change that introduced a regression, fix
      the build when the recently added Kryo cpufreq driver is selected, add
      support for devices attached to multiple power domains to the generic
      power domains (genpd) framework, add support for iowait boosting on
      systens with hardware-managed P-states (HWP) enabled to the
      intel_pstate driver, modify the behavior of the wakeup_count device
      attribute in sysfs, fix a few issues and clean up some ugliness,
      mostly in cpufreq (core and drivers) and in the cpupower utility.
    
      Specifics:
    
       - Revert a recent PM core change that attempted to fix an issue
         related to device links, but introduced a regression (Rafael
         Wysocki)
    
       - Fix build when the recently added cpufreq driver for Kryo
         processors is selected by making it possible to build that driver
         as a module (Arnd Bergmann)
    
       - Fix the long idle detection mechanism in the out-of-band (ondemand
         and conservative) cpufreq governors (Chen Yu)
    
       - Add support for devices in multiple power domains to the generic
         power domains (genpd) framework (Ulf Hansson)
    
       - Add support for iowait boosting on systems with hardware-managed
         P-states (HWP) enabled to the intel_pstate driver and make it use
         that feature on systems with Skylake Xeon processors as it is
         reported to improve performance significantly on those systems
         (Srinivas Pandruvada)
    
       - Fix and update the acpi_cpufreq, ti-cpufreq and imx6q cpufreq
         drivers (Colin Ian King, Suman Anna, Sébastien Szymanski)
    
       - Change the behavior of the wakeup_count device attribute in sysfs
         to expose the number of events when the device might have aborted
         system suspend in progress (Ravi Chandra Sadineni)
    
       - Fix two minor issues in the cpupower utility (Abhishek Goel, Colin
         Ian King)"
    
    * tag 'pm-4.18-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
      Revert "PM / runtime: Fixup reference counting of device link suppliers at probe"
      cpufreq: imx6q: check speed grades for i.MX6ULL
      cpufreq: governors: Fix long idle detection logic in load calculation
      cpufreq: intel_pstate: enable boost for Skylake Xeon
      PM / wakeup: Export wakeup_count instead of event_count via sysfs
      PM / Domains: Add dev_pm_domain_attach_by_id() to manage multi PM domains
      PM / Domains: Add support for multi PM domains per device to genpd
      PM / Domains: Split genpd_dev_pm_attach()
      PM / Domains: Don't attach devices in genpd with multi PM domains
      PM / Domains: dt: Allow power-domain property to be a list of specifiers
      cpufreq: intel_pstate: New sysfs entry to control HWP boost
      cpufreq: intel_pstate: HWP boost performance on IO wakeup
      cpufreq: intel_pstate: Add HWP boost utility and sched util hooks
      cpufreq: ti-cpufreq: Use devres managed API in probe()
      cpufreq: ti-cpufreq: Fix an incorrect error return value
      cpufreq: ACPI: make function acpi_cpufreq_fast_switch() static
      cpufreq: kryo: allow building as a loadable module
      cpupower : Fix header name to read idle state name
      cpupower: fix spelling mistake: "logilename" -> "logfilename"
Comment 3 Eero Tamminen 2018-07-27 15:11:29 UTC
All of these (Ubuntu 16.04) machines use P-state with "powersave" governor, so the P-state ioboost indeed sounds a good candidate.

Issue happening only on specific versions of SKL might help narrowing the change down better, but they aren't Xeon processors mentioned in above commit description.
Comment 4 Chris Wilson 2018-07-27 15:23:46 UTC
+static const struct x86_cpu_id intel_pstate_hwp_boost_ids[] = {
+       ICPU(INTEL_FAM6_SKYLAKE_X, core_funcs),
+       ICPU(INTEL_FAM6_SKYLAKE_DESKTOP, core_funcs),
+       {}
+};

That's all but mobile parts.
Comment 5 Chris Wilson 2018-07-27 15:26:21 UTC
Try echo 0 > /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
Comment 6 Srinivas Pandruvada 2018-07-28 18:20:38 UTC
Created attachment 140870 [details] [review]
Try these steps

1.
First check if you have
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 1
Unfortunately some Xeon E3 reused DESKTOP CPU ID.

2.
If yes, as Chris suggested, try
#echo 0 >  /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost

If this fixes, try the attached patch. After this patch should be 
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 0 if your platform is not a E3.

If you still see attach output of
#acpidump > acpi.out
Comment 7 Srinivas Pandruvada 2018-07-29 00:47:11 UTC
Also please attach
#cat /proc/cpuinfo

Also do test on clear Linux, which always runs in performance mode?
Comment 8 Eero Tamminen 2018-07-30 13:39:58 UTC
(In reply to Srinivas Pandruvada from comment #6)
> First check if you have
> /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 1
> Unfortunately some Xeon E3 reused DESKTOP CPU ID.

It's "1" on the regressing SKL machines, and "0" on the "U" one that didn't regress.


> If yes, as Chris suggested, try
> #echo 0 >  /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost
> 
> If this fixes

At least the largest GpuTest Triangle regression disappears.


> try the attached patch.

I'm running the full set of benchmarks with this, so it takes a while.  I'll report  results tomorrow.


> After this patch should be 
> /sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost == 0 if your platform
> is not a E3.

With the patch it's "0".


(In reply to Srinivas Pandruvada from comment #7)
> Also do test on clear Linux, which always runs in performance mode?

Why?  I assume this is not just to test performance governor [1]...


1. We're tracking upstream Git versions, not older distro versions of SW, so we build our own git versions of (drm-tip) kernel and rest of the 3D stack, and use those also on Clear Linux (currently we use Clear only on BXT).

2. You need to define what you mean by "Clear Linux".

It updates frequently and at least I haven't yet found a way to determine what its packages have actually been built from/with.  Its package management doesn't seem to provide the basic information of:
- which exact sources
- what patches
- compile options
Are used to build a package.


[1] We switched to powersave governor a while because it gives slightly better performance in (3D) benchmarks than Clear Linux default "performance" governor, and didn't regress in any of our (3D) benchmarks. Besides, with Francisco's P-state improvements applied on top of it, (3D / IGP) perf improves very noticeably (only) with powersave.
Comment 9 Srinivas Pandruvada 2018-07-30 13:51:44 UTC
I tested Unigine-Heaven on SKL 6600 Gaming system, with liquid cooling. I didn't see any impact with the change.
I will try GPUTest and check today.
Comment 10 Eero Tamminen 2018-07-30 14:13:30 UTC
Heaven wasn't impacted in my tests either.   Only simpler tests were.


(In reply to Eero Tamminen from comment #0)
> There were following 3D benchmark performance regressions on SKL GT4e:
> - 20% GpuTest Triangle (windowed & composited, half screen size)

./GpuTest /test=triangle /width=1366 /height=768 /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

(Composited with Unity desktop compositor as it's windowed.)


> - 10% GpuTest Triangle (FullHD fullscreen i.e. slower)

./GpuTest /test=triangle /width=1920 /height=1080 /fullscreen /msaa=1 /no_scorebox /benchmark /benchmark_duration_ms=35000

Test does just a clear and draws single triangle covering half a window, so this test is as trivial as glxgears.


GpuTest is the benchmark from here:
https://www.geeks3d.com/gputest/

(Score it reports is just the number of frames rendered.)



> - 1-6% SynMark Batch[2-4]

Batch0-7 are a series of tests where about 2M of triangles of are drawn.  In Batch0 they're with a single draw call, in Batch1 they're split to two calls, in Batch2 to 4 calls etc.  At or after Batch5 i.e. 2M of triangles being split to 32 draw calls, tests start to become CPU bound.


> - 1-4% GfxBench Tessellation (onscreen only), GpuTest Julia32 (windowed), SynMark GSCloth & ShMapPcf (fullscreen)

GSCloth  does 72 draw calls per frame, Tessalation test 50, ShMapPcf 7, and Julia32 does 2 of them.  Julia32 is fastest of these and GSCloth slowest (GSCloth may be slightly CPU bound).
Comment 11 Eero Tamminen 2018-07-31 11:29:09 UTC
(In reply to Eero Tamminen from comment #8)
> (In reply to Srinivas Pandruvada from comment #6)
> > If this fixes, try the attached patch.
> 
> I'm running the full set of benchmarks with this, so it takes a while.
> I'll report  results tomorrow.

Yes, the attached patch fixed all the regressions.


(In reply to Eero Tamminen from comment #0)
> There were no improvements in any 3D benchmarks from this, but I noticed
> small increase in SIMD CPU copy and large increase in SIMD CPU read
> performance.  However, that was only on SKL GT2, not GT4e, so it may be
> unrelated.

The fix patch regressed Unigine Valley slightly.  On closer look Valley had actually increased with original change (by 1.0-1.5%), but only on SKL GT2.  I.e. there was actually one platform with one 3D benchmark where the initial change improved perf slightly (without any noticeable increase in CPU power usage).

-> I assume the SKL GT2 CPU read performance (25%) improvement was also due to original change, but the improvement is very much a corner case [1].

[1] SKL-i5 6600K has 4 real cores.  CPU read (performance with 64MB block improved with the original change *only* in following cases:
- SSE2   with 4 threads
- SSE4.1 with 6 threads
- AVX1/2 with 6 threads

It didn't improve when the number of running SIMD threads was smaller or larger that those.
Comment 12 Chris Wilson 2018-07-31 22:31:15 UTC
commit 01e61a42a5d345a4c0205889498f0c9a0fb9ee8c
Author: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Date:   Mon Jul 30 15:00:29 2018 -0700

    cpufreq: intel_pstate: Limit the scope of HWP dynamic boost platforms
    
    Dynamic boosting of HWP performance on IO wake showed significant
    improvement to IO workloads. This series was intended for Skylake Xeon
    platforms only and feature was enabled by default based on CPU model
    number.
    
    But some Xeon platforms reused the Skylake desktop CPU model number. This
    caused some undesirable side effects to some graphics workloads. Since
    they are heavily IO bound, the increase in CPU performance decreased the
    power available for GPU to do its computing and hence decrease in graphics
    benchmark performance.
    
    For example on a Skylake desktop, GpuTest benchmark showed average FPS
    reduction from 529 to 506.
    
    This change makes sure that HWP boost feature is only enabled for Skylake
    server platforms by using ACPI FADT preferred PM Profile. If some desktop
    users wants to get benefit of boost, they can still enable boost from
    intel_pstate sysfs attribute "hwp_dynamic_boost".
    
    Fixes: 41ab43c9c89e (cpufreq: intel_pstate: enable boost for Skylake Xeon)
    Link: https://bugs.freedesktop.org/show_bug.cgi?id=107410
    Reported-by: Eero Tamminen <eero.t.tamminen@intel.com>
    Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
    Reviewed-by: Francisco Jerez <currojerez@riseup.net>
    Acked-by: Mel Gorman <mgorman@techsingularity.net>
    Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Comment 13 Eero Tamminen 2018-08-10 11:11:07 UTC
Verified, the large regressions are gone.

(And the small Unigine Valley perf improvement also disappeared as expected.)
Comment 14 Lakshmi 2018-08-23 10:03:17 UTC
Closed as it is verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.