Bug 105952 - radv causes GPU hang on SI
Summary: radv causes GPU hang on SI
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Vulkan/radeon (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-09 11:42 UTC by Turo Lamminen
Modified: 2018-04-10 19:59 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (51.40 KB, text/plain)
2018-04-09 11:43 UTC, Turo Lamminen
Details
vulkaninfo output (102.28 KB, text/plain)
2018-04-09 11:43 UTC, Turo Lamminen
Details
radv trace from the hang (174.79 KB, application/gzip)
2018-04-09 13:49 UTC, Turo Lamminen
Details

Description Turo Lamminen 2018-04-09 11:42:31 UTC
GPU hangs when running any Vulkan program. I tested with vulkan-smoketest but seems to happen with anything.

gmc_v6_0_process_interrupt: 28 callbacks suppressed
amdgpu 0000:01:00.0: GPU fault detected: 147 0x0f2a7001
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0F47FFF9
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A070001
amdgpu 0000:01:00.0: VM fault (0x01, vmid 5) at page 256376825, read from '' (0x00000000) (112)

Kernel 4.15.11 (current in Debian testing), LLVM 6.0.0, Pitcairn.

Bisected to:
4ad7595f350462c704fbe5b2bd2ca406c904e78e is the first bad commit
commit 4ad7595f350462c704fbe5b2bd2ca406c904e78e
Author: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Date:   Wed Apr 4 12:12:03 2018 +0200

    radv: rename radv_emit_prefetch() to radv_emit_prefetch_L2()
    
    Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
    Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
    Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>


Despite the commit message it seems to contain functional changes, in particular it seems to enable some DMA transfers on all chips. It looks it doesn't work on SI.
Comment 1 Turo Lamminen 2018-04-09 11:43:06 UTC
Created attachment 138699 [details]
dmesg output
Comment 2 Turo Lamminen 2018-04-09 11:43:36 UTC
Created attachment 138700 [details]
vulkaninfo output
Comment 3 Bas Nieuwenhuizen 2018-04-09 12:01:45 UTC
I think this should be fixed by

https://patchwork.freedesktop.org/patch/215092/
Comment 4 Samuel Pitoiset 2018-04-09 12:12:49 UTC
As Bas said, this should already be fixed. Sorry for the breakage.

Can you update your repo and confirm, please?
Comment 5 Turo Lamminen 2018-04-09 12:16:25 UTC
Still present in latest (a055f5108dfb26522266095d9beb72857d2051f4)

[ 2862.614147] gmc_v6_0_process_interrupt: 28 callbacks suppressed
[ 2862.614150] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0f227001
[ 2862.614155] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0F47FFF9
[ 2862.614157] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02070001
[ 2862.614159] amdgpu 0000:01:00.0: VM fault (0x01, vmid 1) at page 256376825, read from '' (0x00000000) (112)

Also I don't see how that commit would fix it since it refers to compute shaders and none of my test programs use those. Unless the commit message is misleading again.
Comment 6 Samuel Pitoiset 2018-04-09 12:38:34 UTC
Well, I did too many mistakes, sorry.

The following patch should fix the issue:
https://patchwork.freedesktop.org/patch/215920/
Comment 7 Turo Lamminen 2018-04-09 12:44:48 UTC
Well the error message changed but it still hangs...

[  110.666337] gmc_v6_0_process_interrupt: 28 callbacks suppressed
[  110.666340] amdgpu 0000:01:00.0: GPU fault detected: 146 0x028a8804
[  110.666344] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100014
[  110.666346] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088004
[  110.666348] amdgpu 0000:01:00.0: VM fault (0x04, vmid 5) at page 1048596, read from '' (0x00000000) (136)
Comment 8 Turo Lamminen 2018-04-09 13:49:01 UTC
Created attachment 138703 [details]
radv trace from the hang
Comment 9 Turo Lamminen 2018-04-10 09:05:13 UTC
Still happens in 4381be4648b9ebb15b0a06885489998d5daac482
Comment 10 Turo Lamminen 2018-04-10 09:15:26 UTC
I did a little experiment, I rebased locally and removed the broken commit (4ad7595f350462c704fbe5b2bd2ca406c904e78e) and then the followups (942fdfe357, f1d7c16e85, 04e609f1f8) because they no longer applied cleanly. The resulting mesa works and does not exhibit this bug.

So there are no other confounding issuses and there's still some case in there which you've missed on SI.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.