105952 – radv causes GPU hang on SI

Bug 105952 - radv causes GPU hang on SI

Summary: radv causes GPU hang on SI

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Vulkan/radeon (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium major
Assignee:	mesa-dev
QA Contact:	mesa-dev

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-09 11:42 UTC by Turo Lamminen
Modified:	2018-04-10 19:59 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
dmesg output (51.40 KB, text/plain) 2018-04-09 11:43 UTC, Turo Lamminen	Details
vulkaninfo output (102.28 KB, text/plain) 2018-04-09 11:43 UTC, Turo Lamminen	Details
radv trace from the hang (174.79 KB, application/gzip) 2018-04-09 13:49 UTC, Turo Lamminen	Details
View All

Description Turo Lamminen 2018-04-09 11:42:31 UTC

GPU hangs when running any Vulkan program. I tested with vulkan-smoketest but seems to happen with anything.

gmc_v6_0_process_interrupt: 28 callbacks suppressed
amdgpu 0000:01:00.0: GPU fault detected: 147 0x0f2a7001
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0F47FFF9
amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A070001
amdgpu 0000:01:00.0: VM fault (0x01, vmid 5) at page 256376825, read from '' (0x00000000) (112)

Kernel 4.15.11 (current in Debian testing), LLVM 6.0.0, Pitcairn.

Bisected to:
4ad7595f350462c704fbe5b2bd2ca406c904e78e is the first bad commit
commit 4ad7595f350462c704fbe5b2bd2ca406c904e78e
Author: Samuel Pitoiset <samuel.pitoiset@gmail.com>
Date:   Wed Apr 4 12:12:03 2018 +0200

    radv: rename radv_emit_prefetch() to radv_emit_prefetch_L2()
    
    Signed-off-by: Samuel Pitoiset <samuel.pitoiset@gmail.com>
    Reviewed-by: Bas Nieuwenhuizen <bas@basnieuwenhuizen.nl>
    Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>


Despite the commit message it seems to contain functional changes, in particular it seems to enable some DMA transfers on all chips. It looks it doesn't work on SI.

Comment 1 Turo Lamminen 2018-04-09 11:43:06 UTC

Created attachment 138699 [details]
dmesg output

Comment 2 Turo Lamminen 2018-04-09 11:43:36 UTC

Created attachment 138700 [details]
vulkaninfo output

Comment 3 Bas Nieuwenhuizen 2018-04-09 12:01:45 UTC

I think this should be fixed by

https://patchwork.freedesktop.org/patch/215092/

Comment 4 Samuel Pitoiset 2018-04-09 12:12:49 UTC

As Bas said, this should already be fixed. Sorry for the breakage.

Can you update your repo and confirm, please?

Comment 5 Turo Lamminen 2018-04-09 12:16:25 UTC

Still present in latest (a055f5108dfb26522266095d9beb72857d2051f4)

[ 2862.614147] gmc_v6_0_process_interrupt: 28 callbacks suppressed
[ 2862.614150] amdgpu 0000:01:00.0: GPU fault detected: 147 0x0f227001
[ 2862.614155] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0F47FFF9
[ 2862.614157] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02070001
[ 2862.614159] amdgpu 0000:01:00.0: VM fault (0x01, vmid 1) at page 256376825, read from '' (0x00000000) (112)

Also I don't see how that commit would fix it since it refers to compute shaders and none of my test programs use those. Unless the commit message is misleading again.

Comment 6 Samuel Pitoiset 2018-04-09 12:38:34 UTC

Well, I did too many mistakes, sorry.

The following patch should fix the issue:
https://patchwork.freedesktop.org/patch/215920/

Comment 7 Turo Lamminen 2018-04-09 12:44:48 UTC

Well the error message changed but it still hangs...

[  110.666337] gmc_v6_0_process_interrupt: 28 callbacks suppressed
[  110.666340] amdgpu 0000:01:00.0: GPU fault detected: 146 0x028a8804
[  110.666344] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100014
[  110.666346] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A088004
[  110.666348] amdgpu 0000:01:00.0: VM fault (0x04, vmid 5) at page 1048596, read from '' (0x00000000) (136)

Comment 8 Turo Lamminen 2018-04-09 13:49:01 UTC

Created attachment 138703 [details]
radv trace from the hang

Comment 9 Turo Lamminen 2018-04-10 09:05:13 UTC

Still happens in 4381be4648b9ebb15b0a06885489998d5daac482

Comment 10 Turo Lamminen 2018-04-10 09:15:26 UTC

I did a little experiment, I rebased locally and removed the broken commit (4ad7595f350462c704fbe5b2bd2ca406c904e78e) and then the followups (942fdfe357, f1d7c16e85, 04e609f1f8) because they no longer applied cleanly. The resulting mesa works and does not exhibit this bug.

So there are no other confounding issuses and there's still some case in there which you've missed on SI.

Comment 11 Samuel Pitoiset 2018-04-10 19:59:19 UTC

Fixed.

https://cgit.freedesktop.org/~hakzsam/mesa/commit/?id=9f6a28eb27ca059cbadfa5e277bfe4509a426615

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.