111731 – [GEN9+] large perf drop (up to 1/3) in most 3D benchmarks from force-enabling IOMMU

Bug 111731 - [GEN9+] large perf drop (up to 1/3) in most 3D benchmarks from force-enabling IOMMU

Summary: [GEN9+] large perf drop (up to 1/3) in most 3D benchmarks from force-enabling...

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) All

Importance:	high major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2019-09-18 11:41 UTC by Eero Tamminen
Modified:	2019-11-29 19:32 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	BXT, ICL, KBL, SKL
i915 features:	Perf/PMU

Attachments
attachment-21608-0.html (1.58 KB, text/html) 2019-09-19 13:26 UTC, Jon Ewins	no flags	Details
View All

Description Eero Tamminen 2019-09-18 11:41:49 UTC

Setup:
* HW: SKL i7-6770HQ
* OS: Ubuntu 18.04 desktop
* SW stack: git versions of drm-tip kernel, X and Mesa

Between following drm-tip kernel 5.3-rc8 commits:
* 2019-09-10_13-35-40 32c81a317364: drm-tip: 2019y-09m-10d-13h-34m-53s UTC integration manifest
* 2019-09-11_15-07-28 b27acd37b7de: drm-tip: 2019y-09m-11d-15h-06m-37s UTC integration manifest

Kernel performance dropped in most 3D benchmarks.  Worst cases were:
* 27% SynMark CSDof (fullscreen)
* 20-25% GpuTest Triangle (1/2 screen window), SynMark VSTangent
* 20% Unigine Heaven, GpuTest Triangle (fullscreen), SynMark DeferredAA

With few months old git version of Mesa & X server, perf drop in SymMark Fill* tests was also ~20%.

There seems also to be few percent improvement in synMark TexMem* tests kernel performance at the same time, but that's visible only with specific Mesa driver (i965, or Iris) and X server versions.  Performance change in other tests than memory bandwidth ones isn't significantly impacted by Mesa/Xorg version, only by kernel.

This drop is specific SkullCanyon, it's not visible on others platforms (KBL GT3e, SKL/BDW GT2, BXT).  While 3D benchmarks are impacted most, there seems to be marginal perf drop also in Media (transcode) tests.

Although this impacts only SkullCanyon, setting severity as major because the perf drop is so large.

Comment 1 Chris Wilson 2019-09-18 12:06:06 UTC

drm/i915: Make i915_vma.flags atomic_t for mutex reduction
drm/i915: Make shrink/unshrink be atomic

are meh.

drm/i915: Whitelist COMMON_SLICE_CHICKEN2

is a possiblity, but my money is on

drm/i915: Force compilation with intel-iommu for CI validation

A run with intel_iommu=off should test that theory, or intel_iommu=igfx_off and reverting HAX iommu/intel: Ignore igfx_off

Comment 2 Eero Tamminen 2019-09-18 12:55:08 UTC

(In reply to Chris Wilson from comment #1)
> drm/i915: Make i915_vma.flags atomic_t for mutex reduction
> drm/i915: Make shrink/unshrink be atomic
> 
> are meh.
> 
> drm/i915: Whitelist COMMON_SLICE_CHICKEN2
> 
> is a possiblity, but my money is on
> 
> drm/i915: Force compilation with intel-iommu for CI validation
> 
> A run with intel_iommu=off should test that theory, or intel_iommu=igfx_off
> and reverting HAX iommu/intel: Ignore igfx_off

We run all tests currently with "intel_iommu=igfx_off" kernel command line option, and while the author-date in above intel-iommu/igfx_off commits is within range, their drm-tip repo commit dates are actually from Monday this week, not from week ago?

(Why IOMMU perf impact would be SKL GT4e specific?)

Also, whereas latest drm-tip kernel shows:
$ sudo grep mmu /sys/kernel/debug/dri/0/i915_capabilities
iommu: enabled

There's no such output for the 2019-09-11 "b27acd37b7de" kernel where this regression was noticed.


There is a difference on kernel IOMMU outputs between these commits though...

Before:
[    0.625811] DMAR: No ATSR found
[    0.625842] DMAR: dmar1: Using Queued invalidation
[    0.625948] pci 0000:00:00.0: Adding to iommu group 0
[    0.625993] pci 0000:00:08.0: Adding to iommu group 1
[    0.626054] pci 0000:00:14.0: Adding to iommu group 2
[    0.626064] pci 0000:00:14.2: Adding to iommu group 2
[    0.626110] pci 0000:00:16.0: Adding to iommu group 3
[    0.626159] pci 0000:00:1c.0: Adding to iommu group 4
[    0.626203] pci 0000:00:1c.1: Adding to iommu group 5
[    0.626249] pci 0000:00:1c.4: Adding to iommu group 6
[    0.626296] pci 0000:00:1d.0: Adding to iommu group 7
[    0.626348] pci 0000:00:1f.0: Adding to iommu group 8
[    0.626359] pci 0000:00:1f.2: Adding to iommu group 8
[    0.626368] pci 0000:00:1f.3: Adding to iommu group 8
[    0.626378] pci 0000:00:1f.4: Adding to iommu group 8
[    0.626422] pci 0000:00:1f.6: Adding to iommu group 9
[    0.626467] pci 0000:02:00.0: Adding to iommu group 10
[    0.626518] pci 0000:3c:00.0: Adding to iommu group 11
[    0.626523] DMAR: Intel(R) Virtualization Technology for Directed I/O

After:
[    0.625808] DMAR: No ATSR found
[    0.625837] DMAR: dmar0: Using Queued invalidation
[    0.625841] DMAR: dmar1: Using Queued invalidation
[    0.626033] pci 0000:00:00.0: Adding to iommu group 0
[    0.632522] pci 0000:00:02.0: Adding to iommu group 1
[    0.632568] pci 0000:00:08.0: Adding to iommu group 2
[    0.632634] pci 0000:00:14.0: Adding to iommu group 3
[    0.632644] pci 0000:00:14.2: Adding to iommu group 3
[    0.632684] pci 0000:00:16.0: Adding to iommu group 4
[    0.632746] pci 0000:00:1c.0: Adding to iommu group 5
[    0.632797] pci 0000:00:1c.1: Adding to iommu group 6
[    0.632854] pci 0000:00:1c.4: Adding to iommu group 7
[    0.632911] pci 0000:00:1d.0: Adding to iommu group 8
[    0.632966] pci 0000:00:1f.0: Adding to iommu group 9
[    0.632977] pci 0000:00:1f.2: Adding to iommu group 9
[    0.632988] pci 0000:00:1f.3: Adding to iommu group 9
[    0.632998] pci 0000:00:1f.4: Adding to iommu group 9
[    0.633039] pci 0000:00:1f.6: Adding to iommu group 10
[    0.633096] pci 0000:02:00.0: Adding to iommu group 11
[    0.633146] pci 0000:3c:00.0: Adding to iommu group 12
[    0.633233] DMAR: Intel(R) Virtualization Technology for Directed I/O

Other IOMMU / DMAR related dmesg output is identical between the commits:
[    0.002741] ACPI: DMAR 0x000000007A545CD8 0000A8 (v01 INTEL  NUC6i7KY 00000001 INTL 00000001)
...
[    0.125612] Kernel command line: BOOT_IMAGE=/boot/drm_intel root=/dev/nvme0n1p2 rootwait fsck.repair=yes intel_iommu=igfx_off ro
...
[    0.174960] DMAR: Host address width 39
[    0.174962] DMAR: DRHD base: 0x000000fed90000 flags: 0x0
[    0.174967] DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e
[    0.174970] DMAR: DRHD base: 0x000000fed91000 flags: 0x1
[    0.174975] DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da
[    0.174978] DMAR: RMRR base: 0x0000007a275000 end: 0x0000007a294fff
[    0.174980] DMAR: RMRR base: 0x0000007b800000 end: 0x0000007fffffff
[    0.174983] DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1
[    0.174985] DMAR-IR: HPET id 0 under DRHD base 0xfed91000
[    0.174987] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[    0.176506] DMAR-IR: Enabled IRQ remapping in x2apic mode


PS. This regression is large enough that one run of CSDof is enough to see whether kernel version is impacted:
 ./synmark2 OglCSDof

Comment 3 Chris Wilson 2019-09-18 13:04:31 UTC

(In reply to Eero Tamminen from comment #2)
> (In reply to Chris Wilson from comment #1)
> > drm/i915: Make i915_vma.flags atomic_t for mutex reduction
> > drm/i915: Make shrink/unshrink be atomic
> > 
> > are meh.
> > 
> > drm/i915: Whitelist COMMON_SLICE_CHICKEN2
> > 
> > is a possiblity, but my money is on
> > 
> > drm/i915: Force compilation with intel-iommu for CI validation
> > 
> > A run with intel_iommu=off should test that theory, or intel_iommu=igfx_off
> > and reverting HAX iommu/intel: Ignore igfx_off
> 
> We run all tests currently with "intel_iommu=igfx_off" kernel command line
> option, and while the author-date in above intel-iommu/igfx_off commits is
> within range, their drm-tip repo commit dates are actually from Monday this
> week, not from week ago?

The commit is in core-for-CI which is a rebasing tree; on Monday it was rebased to v5.3 so that we could drop some patches. So the commit id will be updated fairly often while it remains in that branch.

> (Why IOMMU perf impact would be SKL GT4e specific?)

My guess at this moment would be that eDRAM feels the hit more significantly. Or that we've just got the caching completely wrong on that sku.

> Also, whereas latest drm-tip kernel shows:
> $ sudo grep mmu /sys/kernel/debug/dri/0/i915_capabilities
> iommu: enabled
> 
> There's no such output for the 2019-09-11 "b27acd37b7de" kernel where this
> regression was noticed.

That is a new feature added so that we could easily determine which machines in the farm have iommu enabled.

> There is a difference on kernel IOMMU outputs between these commits though...
>
[snip]
> After:
> [    0.632522] pci 0000:00:02.0: Adding to iommu group 1

So we definitely enabled iommu on igfx in this range.

> PS. This regression is large enough that one run of CSDof is enough to see
> whether kernel version is impacted:
>  ./synmark2 OglCSDof

20+% regression is also in line with some kbl (gt3e iirc) media runs I did.

Comment 4 Eero Tamminen 2019-09-18 14:04:31 UTC

(In reply to Chris Wilson from comment #3)
> > (Why IOMMU perf impact would be SKL GT4e specific?)
> 
> My guess at this moment would be that eDRAM feels the hit more
> significantly. Or that we've just got the caching completely wrong on that
> sku.

We were supposed to have VT-d disabled from BIOS in all our machines, but apparently that had been enabled when SkullCanyon was in other use for a while. I.e. it was only machine with VT-d enabled.


> 20+% regression is also in line with some kbl (gt3e iirc) media runs I did.

Media, not 3D?  (That's more than I saw on SkullCanyon in media test-cases.)

I've now enabled VT-d on few other machines (BDW GT2, BXT, SKL GT2, KBL GT3e) to get you a bit more perf info.  I'll add that info here later this week.

Comment 5 Jon Ewins 2019-09-19 13:26:46 UTC

Created attachment 145433 [details]
attachment-21608-0.html

I am currently away and will respond when I return

Comment 6 Eero Tamminen 2019-09-19 15:58:30 UTC

Largest performance drops from IOMMU with recent user-space 3D / Media stack are following (all machines have dual channel memory)...

SKL GT4e (i7-6770HQ)
--------------------

* 25-30% SynMark CSDof (fullscreen)
* 20-25% GpuTest windowed 1/2 screen Triangle, SynMark VSTangent
* 20-25% SynMark Fill* [1]
*    20% Unigine Heaven
*    20% GpuTest fullscreen Triangle, SynMark DeferredAA [2]
* 10-20% GLB 2.7 windowed 1/2 screen Egypt & T-Rex
* 10-20% SynMark ZBuffer & Deferred [2]
* 10-15% SynMark TexMem512, GPU write [1]
*  5-10% GfxBench T-Rex, Manhattan 3.0 & 3.1, CarChase, SynMark TerrainFly*
*     5% Unigine Valley, GfxBench AztecRuins
*   2-3%  8-bit, max FullHD, FFmpeg/MediaSDK GPU transcode/downscale

[1] With June user-space. With latest Mesa, Fill* & write tests drop is only 3%, and TexMem512 perf somehow improves by 2%.  Latest Mesa is several percent faster than older one in these tests, due to Mesa slice/subslice balance optimization, no idea how that can reduce impact of IOMMU.

[2] With June user-space. With latest user-space, drop in these specific tests is half of that, or less.  For fullscreen Triangle case, potentially relevant user-space change could be latest X server disabling atomic commits.  See: https://gitlab.freedesktop.org/xorg/xserver/issues/888


KBL GT3e (i7-7567U)
-------------------

* 30-35% SynMark Fill* [1]
* 20-25% GpuTest windowed 1/2 screen Triangle, SynMark VSTangent
* 10-15% GLB 2.7 windowed 1/2 screen Egypt & T-Rex, SynMark ZBuffer
* 10-15% GpuTest fullscreen Triangle, GPU write [1]
* 10-15% 4K HEVC GPU decode + hwdownload
*  5-10% Unigine Heaven, GfxBench T-Rex, Manhattan 3.0 & 3.1 & CarChase, SynMark CSDof
*  5-10% SynMark Deferred* [1]
*  5-10% 10-bit, 4K HEVC GPU transcode
*   2-3% 8-bit, max FullHD, FFmpeg/MediaSDK GPU transcode/downscale

[1] With June user-space. With latest user-space, drop in these specific tests is about half.


SKL GT2 (i5-6600K)
------------------

*   >25% GPU write, SynMark VSTangent
*    20% SynMark Fill*, GpuTest windowed Triangle
* 15-20% GLB Egypt
*    15% SynMark ZBuffer
*    10% GLB T-Rex, GpuTest fullscreen Triangle
*  5-10% GfxBench Manhattan 3.0 & 3.1, T-Rex, CarChase
*     5% GfxBench AztecRuins, Unigine Heaven
*   2-6% 8-bit, max FullHD, FFmpeg/MediaSDK GPU transcode/downscale

With the June user-space, there are some differences in how much performance drops, but nothing major like with GT3e & GT4e (where slice/subslice issue balance had noticeable impact).


BXT J4205
---------

Results similar to other devices (not reported here as this has higher variances than them).  Similarly to SKL GT2, user-space version doesn't have significant impact on how much IOMMU regresses performance.


BDW GT2
-------

As expected, no impact (kernel skips IOMMU for BDW).


Summary
-------

* IOMMU can lose up to third of performance in worst synthetic case, and 5-15% in real GPU (3d/Media) use-cases.

* Seems that badly balanced sub/slice utilization could have noticeable impact in IOMMU performance impact for some use-cases.

Comment 7 Eero Tamminen 2019-09-24 10:08:47 UTC

ICL-U with IOMMU
----------------

Performance didn't improve in any test from enabling IOMMU. Largest performance drops are following:

*    30% SynMark VSTangent & ZBuffer
*    24% MemBW GPU write (with i965, 6-7% with Iris & higher FPS (*))
*    18% GLB 2.7 Egypt (windowed), SynMark VSDiffuse*
*    17% GpuTest v0.7 Triangle (windowed)
*    15% MemBW GPU blit (with i965, 20% with Iris & lower FPS (*))
*    14% GLB 2.7 T-Rex (windowed), SynMark HdrBloom
*  5-15% HEVC decode & download
* 11-12% GfxBench Manhattan 3.0 & 31, GpuTest Triangle, SynMark DrvRes
*  9-11% GfxBench T-Rex & CarChase, SynMark TerrainFly*
*   7-9% GfxBench AztecRuins, SynMark TexMem128, DeferredAA, ShMapVsm
*   5-7% Unigine Heaven & Valley, GfxBench ALU2, SynMark TexMem512, Deferred, TerrainPan*, CSCloth
*   2-6% 8-bit <FullHD Media transcode

(*) Iris doesn't support E2E RBC, so in onscreen tests it uses more bandwidth / frame, and therefore has lower perf.

PS. 30% perf drop from enabling IOMMU == >40% perf increase from disabling IOMMU.

Comment 8 Martin Peres 2019-11-29 19:32:10 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/430.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.