Bug 103100

Summary: Performance regression with various games in drm-next-amd-staging Kernel
Product: Mesa Reporter: Gregor Münch <gr.muench>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: adf.lists, ckoenig.leichtzumerken, vedran
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Shadow of Mordor benchmark current amd-staging kernel
Shadow of Mordor benchmark older amd-staging kernel
Shadow of Mordor benchmark current amd-staging kernel shader cache

Description Gregor Münch 2017-10-04 16:38:21 UTC
Im running current drm-next-4.15-wip Kernel and I use AMDGPU with Radeon HD 7970
DC disabled.

The following is wrong:
-Performance in Shadow of Mordor internal benchmark decreases from 68 to 61 fps
-also other games see a small decrease of 1-2 fps
-I see random screen corruptions on my desktop
-after I exit from a game, the system is unstable, screen corruptions are even more visible and the systems randomly hangs 

I bisected this to:
fd8bf087dffc0bce047c5aea2afcb8f821e48db1 is the first bad commit
commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1
Author: Christian König <christian.koenig@amd.com>
Date:   Tue Aug 29 16:14:32 2017 +0200

    drm/amdgpu: bump version for support of local BOs
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 440c9b026e802e50b6a25ae3b402ea57ef58a891 d31d8e8b93060b11e88f95d4d3bdcf081c77e4e2 M	drivers

This is probably not making any sense, I guess one of the previous commits related to BOs are faulty. To double checked things I used git checkout between those commits and make clean during the steps. Its still very unusual but maybe a dev know whats going on.

log:

amdgpu 0000:01:00.0: GPU fault detected: 146 0x030f3d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010CD18
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F03D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 7) at page 1101080, write from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0f073d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010E178
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1106296, write from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0a3d0c
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010E670
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A03D00C
kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5) at page 1107568, read from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0a3d0c
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0010E673
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A03D00C
kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5) at page 1107571, read from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0e440c
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00104670
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C
kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 7) at page 1066608, read from '' (0x00000000) (68)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0c0f3d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101960
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F03D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 7) at page 1055072, write from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0b3d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001017F0
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B03D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 5) at page 1054704, write from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x02073d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00112F90
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1126288, write from '' (0x00000000) (61)
kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x08073d14
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00110E40
kernel: amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014
kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1117760, write from '' (0x00000000) (61)
Comment 1 Nicolai Hähnle 2017-10-12 06:52:10 UTC
Thanks for the report. The bisection result makes perfect sense, as the version bump can change how Mesa behaves. Could you please provide the version of Mesa you're using? (The output of glxinfo contains this.)
Comment 2 Gregor Münch 2017-10-15 18:57:54 UTC
I tested again with yesterdays git and newer Kernel:
OpenGL renderer string: AMD Radeon HD 7900 Series (TAHITI / DRM 3.21.0 / 4.14.0-2-drm-next-dc-git, LLVM 6.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.3.0-devel (git-0c1aecf177)

The system doesnt hang anymore, it completes phoronix benchmark runs again. Before it was always crashing when the 3rd game test was started. The benchmark suite let the games run 3 times each test. So it usually crashed after 7-8 starts.
However I still see those GPU faults in the log.
Performance in Shadow of Mordor is still just 61-62fps.
Comment 3 Gregor Münch 2017-11-21 20:35:19 UTC
I did a rather lengthy test with some games:
https://openbenchmarking.org/result/1711211-AL-GAMETEST345

To conclude, it is slower for everything game I tested.
Dota2 Vulkan and Unigine Superposition are one of the larger drops.

This correlates to findings from phoronix btw:
https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-DRM-4.15-Early
Comment 4 Michel Dänzer 2017-11-22 10:40:11 UTC
One possible issue with per-VM BOs is that the kernel driver can no longer migrate BOs to more optimal placement for a command stream, because it doesn't know which BOs are used by the command stream. So if e.g. a per-VM BO is evicted from VRAM, e.g. due to CPU access to it, it will normally never move back to VRAM.

Christian, might it be possible to e.g. maintain a list of per-VM BOs which were evicted from VRAM, and try to move them back to VRAM as part of the existing mechanism for this (for "normal" BOs)?
Comment 5 Andy Furniss 2017-11-22 11:42:55 UTC
*** Bug 103175 has been marked as a duplicate of this bug. ***
Comment 6 Christian König 2017-11-22 11:56:41 UTC
(In reply to Michel Dänzer from comment #4)
> Christian, might it be possible to e.g. maintain a list of per-VM BOs which
> were evicted from VRAM, and try to move them back to VRAM as part of the
> existing mechanism for this (for "normal" BOs)?

That would certainly be possible, but I don't think it would help in any way.

The kernel simply doesn't know any more which BOs are currently used and which aren't. So as soon as userspace allocates more VRAM than physical available we are practically lost.

In other words we would just cycle over a list of BOs evicted from VRAM on every submission and would rarely be able to move something back in.

What we could do is try to move buffers back into VRAM when memory is freed, but that happens so rarely as well that it probably doesn't make much sense either.

Can somebody analyze exactly why those games are now slower than they have been before? E.g. which buffers are fighting for VRAM? Or are they maybe fighting for GTT?
Comment 7 Gregor Münch 2017-11-22 20:58:27 UTC
Created attachment 135672 [details]
Shadow of Mordor benchmark current amd-staging kernel

This is current situation...
I dont know if this helps.
Comment 8 Gregor Münch 2017-11-22 20:59:42 UTC
Created attachment 135673 [details]
Shadow of Mordor benchmark older amd-staging kernel
Comment 9 Gregor Münch 2017-11-22 21:02:09 UTC
Created attachment 135674 [details]
Shadow of Mordor benchmark current amd-staging kernel shader cache

This is the situation after the shader cache kicked in. Seems to be not any different.
Comment 10 Gregor Münch 2017-12-02 19:32:50 UTC
Ive updated test, with staging kernel and mesa from today:
https://openbenchmarking.org/result/1712028-AL-GAMETEST322

Looks like I enabled performance governor by accident. But overall 4.11 is still faster- so it looks like there were some performance improvements in mesa in the last days.
Comment 11 Andy Furniss 2017-12-12 13:02:23 UTC
For my test case - UnrealTournament alpha + not quite enough vram, looks like mesa commit

winsys/amdgpu: disable local BOs again due to worse performance
https://cgit.freedesktop.org/mesa/mesa/commit/?id=bf0904e31fb7d9cd8932d582076c8d7beb02ba89

works around the issue.
Comment 12 Gregor Münch 2017-12-12 19:51:05 UTC
Ive tested Shadow of Mordor and something in the last two weeks made it faster, either Kernel or Mesa or LLVM. Basically it went from 61/63 fps to 66 fps. With Mesa from today, performance is restored to 68 fps.

Ive also ran my test suite again:
https://openbenchmarking.org/result/1712125-AL-GAMETEST346

Especially Unigine Superposition saw great benefit and went from 40 to 49 fps.
Dota 2 vulkan is still in a regressed state.

Ive also made a trace file with Shadow of Mordor with older kernel and good performance some days ago: https://uploadfiles.io/ktjmx
I still struggle to compress the trace with the "bad" performance, if you guys are interested I try to provide it.

Marking resolved for now.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.