Running a polaris Pro Duo on Ubuntu 18.04, Kernel from ROCm branch 4.17.0-rc2-180424-fkxamd and everything works great. In testing latest drm-next-4.19-wip kernel, I get the following errors on boot, and have no working opencl (ie clinfo hangs indefinitely) [ 58.913281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=573, emitted seq=574 [ 58.913284] [drm] GPU recovery disabled. [ 58.914276] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=331, emitted seq=333 [ 58.914280] [drm] GPU recovery disabled. [ 58.914312] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=331, emitted seq=333 [ 58.914313] [drm] GPU recovery disabled. Please let me know if you need any specifics.
Please attach your full dmesg output. Are you attempting to use ROCm over the drm-next-4.19-wip kernel?
Created attachment 140696 [details] dmesg.base.txt Polaris radeon pro duo, dmesg with no userland.
Created attachment 140697 [details] dmesg.amdgpupro.txt Polaris radeon pro duo, dmesg with amdgpu-pro 18.20-606296
No, I wasnt running ROCm userland. I've been using amdgpu-pro-18.20-606296 for several weeks with the fkxamd kernel as recommened by Felix. When I remove all userland, I dont see the ring sdma0 timeout. Without any userland, it initializes the driver, but with a couple warnings.
So I stripped this down, and the ring error pops up after I've applied a new pp_table and start utilizing the GPU. My guess is this error has something to do with why this doesnt work in drm-next-4.19-wip [ 2.635258] amdgpu: [powerplay] Failed to retrieve minimum clocks. [ 2.635259] amdgpu: [powerplay] Error in phm_get_clock_info That error is not present when I load 4.17.0-rc2-180424-fkxamd kernel. I apply same pp_table file while running 4.17.0-rc2-180424-fkxamd, and it works as expected. So there is something funky in the powerplay of drm-next-4.19-wip
amd-staging-drm-next (built Oct 7 2018) [ 61.701281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=888, emitted seq=890 [ 61.701285] [drm] GPU recovery disabled. [ 61.701397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=902, emitted seq=904 [ 61.701399] [drm] GPU recovery disabled. drm-next-4.20-wip (built Oct 8 2018) [ 60.840847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=914, emitted seq=916 [ 60.840851] [drm] GPU recovery disabled. [ 60.840962] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=907, emitted seq=909 [ 60.840964] [drm] GPU recovery disabled. Both of these kernels work fine on my Vega 56 and Vega 64's, just the Pro Duo has the ring timeouts. Was tested with amdgpu-pro 18.20 and 18.30, and nothing utilizing the GPUs besides on boot initializations.
All Polaris are experiencing ring errors on mainline kernels, its not just Pro Duo Polaris. # lspci | grep VGA 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef) 02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef) 04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef) 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev cf) 09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef) # uname -a Linux localhost 4.19.0-999-lowlatency #201810092201 SMP PREEMPT Wed Oct 10 02:12:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux # dmesg | grep amdgpu [ 8.125848] amdgpu: [powerplay] Failed to retrieve minimum clocks. [ 8.125849] amdgpu: [powerplay] Error in phm_get_clock_info [ 8.260967] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:09:00.0 on minor 4 [ 70.238071] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=597, emitted seq=599 [ 70.238198] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=597, emitted seq=599 etc etc
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/16.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.