Bug 109649

Summary:	[bisected][raven] gfx ring timeout when running clover apps
Product:	DRI	Reporter:	Jan Vesely <jv356>
Component:	DRM/AMDgpu	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:
Severity:	normal
Priority:	medium	CC:	christian.koenig
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:

Description Jan Vesely 2019-02-15 23:02:20 UTC

This is a regression in 4.20.x, the same userspace works ok on 4.19.
I could bisect, but it's my main machine so I can't quite dedicate the time, any hint would be appreciated.
The kernel is booted using iommu=soft. full iommu hangs on boot, and noimmu disables the wi-fi.

Dmesg:
> [  702.207054] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1340, emitted seq=1342
> [  702.207061] [drm] GPU recovery disabled.

lspci -nn:
05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:15dd] (rev c4)

It's a thinkpad e485 laptop with:
AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx (family: 0x17, model: 0x11, stepping: 0x0)

Comment 1 Jan Vesely 2019-02-28 09:25:17 UTC

Bisection shows that the first bad commit is:
commit 09b6f25b55d9c66af7302e1f09ad90aa5b1dfbcb (HEAD, refs/bisect/bad)
Author: Christian König <christian.koenig@amd.com>
Date:   Wed Aug 15 14:04:47 2018 +0200

    drm/amdgpu: fix VM size reporting on Raven
    
    Raven doesn't have an VCE block and so also no buggy VCE firmware.
    
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Reviewed-by: Huang Rui <ray.huang@amd.com>
    Acked-by: Chunming Zhou <david1.zhou@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

I guess there is other buggy firmware/limitation?

# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 40, firmware version: 0x00000099
PFP feature version: 40, firmware version: 0x000000ae
CE feature version: 40, firmware version: 0x0000004d
RLC feature version: 1, firmware version: 0x0000d237
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001
MEC feature version: 40, firmware version: 0x0000018b
MEC2 feature version: 40, firmware version: 0x0000018b
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x0017ba78
SMC feature version: 0, firmware version: 0x00001e49
SDMA0 feature version: 41, firmware version: 0x000000a9
VCN feature version: 0, firmware version: 0x01004912
VBIOS version: 113-RAVEN-106

Comment 2 Jan Vesely 2019-02-28 09:36:25 UTC

I've confirmed that reverting the change on top of 4.20.13 fixes the issue.

Comment 3 Jan Vesely 2019-02-28 16:03:37 UTC

The bug is still present in 5.0.0-rc8.

Comment 4 Jan Vesely 2019-03-08 02:51:39 UTC

The issue appears fixed with new firmware, but now the laptop won't suspend.

# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 40, firmware version: 0x00000099
PFP feature version: 40, firmware version: 0x000000ae
CE feature version: 40, firmware version: 0x0000004d
RLC feature version: 1, firmware version: 0x0000d237
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001
MEC feature version: 40, firmware version: 0x0000018b
MEC2 feature version: 40, firmware version: 0x0000018b
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x0017ba78
SMC feature version: 0, firmware version: 0x00001e49
SDMA0 feature version: 41, firmware version: 0x000000a9
VCN feature version: 0, firmware version: 0x01004912
DMCU feature version: 0, firmware version: 0x00000001
VBIOS version: 113-RAVEN-106

Comment 5 Jan Vesely 2019-03-08 04:01:47 UTC

since the sysfs does not show fw difference, here's the change in files:
$ diff old_fw new_fw 
8,9c8
- e2ddb912bf242e3b1b4219b36a19bff7  /lib/firmware/amdgpu/raven2_rlc.bin
- 27168d5b60ef396926a2aa0e2da00a97  /lib/firmware/amdgpu/raven2_sdma1.bin
---
+ 4ac07f88b9c4aa4fe026be87cb16ceda  /lib/firmware/amdgpu/raven2_rlc.bin


(In reply to Jan Vesely from comment #4)
> The issue appears fixed with new firmware, but now the laptop won't suspend.

The same workaround as before fixes the suspend/resume issue.

drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:709
+                      vm_size = min(vm_size, 1ULL << 40);

Comment 6 Jan Vesely 2019-03-11 03:41:57 UTC

I managed to get IOMMU working by passing "amd_iommu=pt ivrs_ioapic[32]=00:14.0" on the kernel commandline.
Now it's back to square one.
all clover kernels hang the GPU unless I limit VM size to 'vm_size = min(vm_size, 1ULL << 40);'
otherwise the machine works (including 3d graphics and suspend/resume).

Comment 7 Jan Vesely 2019-05-07 06:01:54 UTC

The workaround is still necessary in kernel 5.1.0.
The failure mode is a bit different, it hangs just the application, not entire machine.

Comment 8 Martin Peres 2019-11-19 09:13:31 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/698.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.