Bug 107266

Summary: Radeon Pro Duo (Polaris) - ring sdma0 timeout
Product: DRI Reporter: robert <rhlug>
Component: DRM/AMDgpu-proAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg.base.txt
none
dmesg.amdgpupro.txt none

Description robert 2018-07-17 20:39:44 UTC
Running a polaris Pro Duo on Ubuntu 18.04, Kernel from ROCm branch 4.17.0-rc2-180424-fkxamd and everything works great.

In testing latest drm-next-4.19-wip kernel, I get the following errors on boot, and have no working opencl (ie clinfo hangs indefinitely)


[   58.913281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=573, emitted seq=574
[   58.913284] [drm] GPU recovery disabled.
[   58.914276] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=331, emitted seq=333
[   58.914280] [drm] GPU recovery disabled.
[   58.914312] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=331, emitted seq=333
[   58.914313] [drm] GPU recovery disabled.



Please let me know if you need any specifics.
Comment 1 Alex Deucher 2018-07-17 20:49:54 UTC
Please attach your full dmesg output.  Are you attempting to use ROCm over the drm-next-4.19-wip kernel?
Comment 2 robert 2018-07-18 14:09:01 UTC
Created attachment 140696 [details]
dmesg.base.txt

Polaris radeon pro duo, dmesg with no userland.
Comment 3 robert 2018-07-18 14:09:49 UTC
Created attachment 140697 [details]
dmesg.amdgpupro.txt

Polaris radeon pro duo, dmesg with amdgpu-pro 18.20-606296
Comment 4 robert 2018-07-18 14:10:21 UTC
No, I wasnt running ROCm userland.

I've been using amdgpu-pro-18.20-606296 for several weeks with the fkxamd kernel as recommened by Felix.

When I remove all userland, I dont see the ring sdma0 timeout.  Without any userland,  it initializes the driver, but with a couple warnings.
Comment 5 robert 2018-07-18 14:28:05 UTC
So I stripped this down, and the ring error pops up after I've applied a new pp_table and start utilizing the GPU. 

My guess is this error has something to do with why this doesnt work in drm-next-4.19-wip

[    2.635258] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[    2.635259] amdgpu: [powerplay] Error in phm_get_clock_info 

That error is not present when I load 4.17.0-rc2-180424-fkxamd kernel.

I apply same pp_table file while running 4.17.0-rc2-180424-fkxamd, and it works as expected.

So there is something funky in the powerplay of drm-next-4.19-wip
Comment 6 dallase 2018-10-08 15:34:33 UTC

amd-staging-drm-next (built Oct 7 2018)

[   61.701281] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=888, emitted seq=890
[   61.701285] [drm] GPU recovery disabled.
[   61.701397] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=902, emitted seq=904
[   61.701399] [drm] GPU recovery disabled.

drm-next-4.20-wip (built Oct 8 2018)

[   60.840847] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=914, emitted seq=916
[   60.840851] [drm] GPU recovery disabled.
[   60.840962] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=907, emitted seq=909
[   60.840964] [drm] GPU recovery disabled.


Both of these kernels work fine on my Vega 56 and Vega 64's, just the Pro Duo has the ring timeouts.   Was tested with amdgpu-pro 18.20 and 18.30, and nothing utilizing the GPUs besides on boot initializations.
Comment 7 robert 2018-10-14 19:29:47 UTC
All Polaris are experiencing ring errors on mainline kernels, its not just Pro Duo Polaris.


# lspci | grep VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef)
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef)
04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev cf)
09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev ef)

# uname -a
Linux localhost 4.19.0-999-lowlatency #201810092201 SMP PREEMPT Wed Oct 10 02:12:06 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux


# dmesg | grep amdgpu
[    8.125848] amdgpu: [powerplay] Failed to retrieve minimum clocks.
[    8.125849] amdgpu: [powerplay] Error in phm_get_clock_info 
[    8.260967] [drm] Initialized amdgpu 3.27.0 20150101 for 0000:09:00.0 on minor 4
[   70.238071] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=597, emitted seq=599
[   70.238198] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=597, emitted seq=599

etc etc
Comment 8 Martin Peres 2019-11-19 07:58:47 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/16.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.