Im running current drm-next-4.15-wip Kernel and I use AMDGPU with Radeon HD 7970 DC disabled. The following is wrong: -Performance in Shadow of Mordor internal benchmark decreases from 68 to 61 fps -also other games see a small decrease of 1-2 fps -I see random screen corruptions on my desktop -after I exit from a game, the system is unstable, screen corruptions are even more visible and the systems randomly hangs I bisected this to: fd8bf087dffc0bce047c5aea2afcb8f821e48db1 is the first bad commit commit fd8bf087dffc0bce047c5aea2afcb8f821e48db1 Author: Christian König <christian.koenig@amd.com> Date: Tue Aug 29 16:14:32 2017 +0200 drm/amdgpu: bump version for support of local BOs Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 440c9b026e802e50b6a25ae3b402ea57ef58a891 d31d8e8b93060b11e88f95d4d3bdcf081c77e4e2 M drivers This is probably not making any sense, I guess one of the previous commits related to BOs are faulty. To double checked things I used git checkout between those commits and make clean during the steps. Its still very unusual but maybe a dev know whats going on. log: amdgpu 0000:01:00.0: GPU fault detected: 146 0x030f3d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010CD18 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F03D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 7) at page 1101080, write from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0f073d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010E178 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1106296, write from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0a3d0c kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010E670 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A03D00C kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5) at page 1107568, read from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0a3d0c kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x0010E673 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A03D00C kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 5) at page 1107571, read from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0e440c kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00104670 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E04400C kernel: amdgpu 0000:01:00.0: VM fault (0x0c, vmid 7) at page 1066608, read from '' (0x00000000) (68) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0c0f3d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00101960 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0F03D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 7) at page 1055072, write from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x0e0b3d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001017F0 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0B03D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 5) at page 1054704, write from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x02073d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00112F90 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1126288, write from '' (0x00000000) (61) kernel: amdgpu 0000:01:00.0: GPU fault detected: 146 0x08073d14 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x00110E40 kernel: amdgpu 0000:01:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0703D014 kernel: amdgpu 0000:01:00.0: VM fault (0x14, vmid 3) at page 1117760, write from '' (0x00000000) (61)
Thanks for the report. The bisection result makes perfect sense, as the version bump can change how Mesa behaves. Could you please provide the version of Mesa you're using? (The output of glxinfo contains this.)
I tested again with yesterdays git and newer Kernel: OpenGL renderer string: AMD Radeon HD 7900 Series (TAHITI / DRM 3.21.0 / 4.14.0-2-drm-next-dc-git, LLVM 6.0.0) OpenGL core profile version string: 4.5 (Core Profile) Mesa 17.3.0-devel (git-0c1aecf177) The system doesnt hang anymore, it completes phoronix benchmark runs again. Before it was always crashing when the 3rd game test was started. The benchmark suite let the games run 3 times each test. So it usually crashed after 7-8 starts. However I still see those GPU faults in the log. Performance in Shadow of Mordor is still just 61-62fps.
I did a rather lengthy test with some games: https://openbenchmarking.org/result/1711211-AL-GAMETEST345 To conclude, it is slower for everything game I tested. Dota2 Vulkan and Unigine Superposition are one of the larger drops. This correlates to findings from phoronix btw: https://www.phoronix.com/scan.php?page=news_item&px=AMDGPU-DRM-4.15-Early
One possible issue with per-VM BOs is that the kernel driver can no longer migrate BOs to more optimal placement for a command stream, because it doesn't know which BOs are used by the command stream. So if e.g. a per-VM BO is evicted from VRAM, e.g. due to CPU access to it, it will normally never move back to VRAM. Christian, might it be possible to e.g. maintain a list of per-VM BOs which were evicted from VRAM, and try to move them back to VRAM as part of the existing mechanism for this (for "normal" BOs)?
*** Bug 103175 has been marked as a duplicate of this bug. ***
(In reply to Michel Dänzer from comment #4) > Christian, might it be possible to e.g. maintain a list of per-VM BOs which > were evicted from VRAM, and try to move them back to VRAM as part of the > existing mechanism for this (for "normal" BOs)? That would certainly be possible, but I don't think it would help in any way. The kernel simply doesn't know any more which BOs are currently used and which aren't. So as soon as userspace allocates more VRAM than physical available we are practically lost. In other words we would just cycle over a list of BOs evicted from VRAM on every submission and would rarely be able to move something back in. What we could do is try to move buffers back into VRAM when memory is freed, but that happens so rarely as well that it probably doesn't make much sense either. Can somebody analyze exactly why those games are now slower than they have been before? E.g. which buffers are fighting for VRAM? Or are they maybe fighting for GTT?
Created attachment 135672 [details] Shadow of Mordor benchmark current amd-staging kernel This is current situation... I dont know if this helps.
Created attachment 135673 [details] Shadow of Mordor benchmark older amd-staging kernel
Created attachment 135674 [details] Shadow of Mordor benchmark current amd-staging kernel shader cache This is the situation after the shader cache kicked in. Seems to be not any different.
Ive updated test, with staging kernel and mesa from today: https://openbenchmarking.org/result/1712028-AL-GAMETEST322 Looks like I enabled performance governor by accident. But overall 4.11 is still faster- so it looks like there were some performance improvements in mesa in the last days.
For my test case - UnrealTournament alpha + not quite enough vram, looks like mesa commit winsys/amdgpu: disable local BOs again due to worse performance https://cgit.freedesktop.org/mesa/mesa/commit/?id=bf0904e31fb7d9cd8932d582076c8d7beb02ba89 works around the issue.
Ive tested Shadow of Mordor and something in the last two weeks made it faster, either Kernel or Mesa or LLVM. Basically it went from 61/63 fps to 66 fps. With Mesa from today, performance is restored to 68 fps. Ive also ran my test suite again: https://openbenchmarking.org/result/1712125-AL-GAMETEST346 Especially Unigine Superposition saw great benefit and went from 40 to 49 fps. Dota 2 vulkan is still in a regressed state. Ive also made a trace file with Shadow of Mordor with older kernel and good performance some days ago: https://uploadfiles.io/ktjmx I still struggle to compress the trace with the "bad" performance, if you guys are interested I try to provide it. Marking resolved for now.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.