Summary: | [bisected][raven] gfx ring timeout when running clover apps | ||
---|---|---|---|
Product: | DRI | Reporter: | Jan Vesely <jv356> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | normal | ||
Priority: | medium | CC: | christian.koenig |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Bisection shows that the first bad commit is: commit 09b6f25b55d9c66af7302e1f09ad90aa5b1dfbcb (HEAD, refs/bisect/bad) Author: Christian König <christian.koenig@amd.com> Date: Wed Aug 15 14:04:47 2018 +0200 drm/amdgpu: fix VM size reporting on Raven Raven doesn't have an VCE block and so also no buggy VCE firmware. Signed-off-by: Christian König <christian.koenig@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Reviewed-by: Huang Rui <ray.huang@amd.com> Acked-by: Chunming Zhou <david1.zhou@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> I guess there is other buggy firmware/limitation? # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 40, firmware version: 0x00000099 PFP feature version: 40, firmware version: 0x000000ae CE feature version: 40, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x0000d237 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 40, firmware version: 0x0000018b MEC2 feature version: 40, firmware version: 0x0000018b SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x0017ba78 SMC feature version: 0, firmware version: 0x00001e49 SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x01004912 VBIOS version: 113-RAVEN-106 I've confirmed that reverting the change on top of 4.20.13 fixes the issue. The bug is still present in 5.0.0-rc8. The issue appears fixed with new firmware, but now the laptop won't suspend. # cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x00000000 UVD feature version: 0, firmware version: 0x00000000 MC feature version: 0, firmware version: 0x00000000 ME feature version: 40, firmware version: 0x00000099 PFP feature version: 40, firmware version: 0x000000ae CE feature version: 40, firmware version: 0x0000004d RLC feature version: 1, firmware version: 0x0000d237 RLC SRLC feature version: 1, firmware version: 0x00000001 RLC SRLG feature version: 1, firmware version: 0x00000001 RLC SRLS feature version: 1, firmware version: 0x00000001 MEC feature version: 40, firmware version: 0x0000018b MEC2 feature version: 40, firmware version: 0x0000018b SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x0017ba78 SMC feature version: 0, firmware version: 0x00001e49 SDMA0 feature version: 41, firmware version: 0x000000a9 VCN feature version: 0, firmware version: 0x01004912 DMCU feature version: 0, firmware version: 0x00000001 VBIOS version: 113-RAVEN-106 since the sysfs does not show fw difference, here's the change in files: $ diff old_fw new_fw 8,9c8 - e2ddb912bf242e3b1b4219b36a19bff7 /lib/firmware/amdgpu/raven2_rlc.bin - 27168d5b60ef396926a2aa0e2da00a97 /lib/firmware/amdgpu/raven2_sdma1.bin --- + 4ac07f88b9c4aa4fe026be87cb16ceda /lib/firmware/amdgpu/raven2_rlc.bin (In reply to Jan Vesely from comment #4) > The issue appears fixed with new firmware, but now the laptop won't suspend. The same workaround as before fixes the suspend/resume issue. drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c:709 + vm_size = min(vm_size, 1ULL << 40); I managed to get IOMMU working by passing "amd_iommu=pt ivrs_ioapic[32]=00:14.0" on the kernel commandline. Now it's back to square one. all clover kernels hang the GPU unless I limit VM size to 'vm_size = min(vm_size, 1ULL << 40);' otherwise the machine works (including 3d graphics and suspend/resume). The workaround is still necessary in kernel 5.1.0. The failure mode is a bit different, it hangs just the application, not entire machine. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/698. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
This is a regression in 4.20.x, the same userspace works ok on 4.19. I could bisect, but it's my main machine so I can't quite dedicate the time, any hint would be appreciated. The kernel is booted using iommu=soft. full iommu hangs on boot, and noimmu disables the wi-fi. Dmesg: > [ 702.207054] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=1340, emitted seq=1342 > [ 702.207061] [drm] GPU recovery disabled. lspci -nn: 05:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Device [1002:15dd] (rev c4) It's a thinkpad e485 laptop with: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx (family: 0x17, model: 0x11, stepping: 0x0)