Bug 111551

Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout
Product: DRI Reporter: yanhua <78666679>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED INVALID QA Contact:
Severity: major    
Priority: not set CC: 78666679, christian.koenig
Version: XOrg git   
Hardware: ARM   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg output
none
The previous dmesg.txt has messages been overwriten. from the dmesg-full.txt can see more information none

Description yanhua 2019-09-03 13:40:26 UTC
The amdgpu(pollaries10, wx5100) drm drivers sometimes report:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865

and many threads run into disk sleeping state


kernel version: 4.19.36

mesa: 18.3.6
Comment 1 yanhua 2019-09-03 13:42:45 UTC
Created attachment 145253 [details]
dmesg output

grep drm dmesg.txt. there are sdma1 ring timout
Comment 2 yanhua 2019-09-04 05:14:56 UTC
Created attachment 145260 [details]
The previous  dmesg.txt has  messages  been overwriten. from the dmesg-full.txt can see more information
Comment 3 Christian König 2019-09-04 11:45:27 UTC
As far as I can see this is a really large box with multiple GPUs installed.

The SDMA rarely locks up, especially not while executing page table updates. So there is most likely something wrong with the hardware here.

Are you sure that the power supply is large enough for that system?

What system/platform is that? Could this be a coherency problem?
Comment 4 yanhua 2019-09-04 12:26:49 UTC
I have asked hardware team, they have tested, and can be sure there are no power supply problem.


The system is arm64 with 64 cores. and there are three amdgpu card in the board.


there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring timeout occur, we can use amd supplied tools umr to read chip registers. can we know the real cause from the register value?

with the coherency problem you said, I think if that was true. the problem should occur more frequently. I'm not sure.
Comment 5 Christian König 2019-09-04 12:35:52 UTC
amdgpu is known to not work on arm64 until very recently.

So it is not a supprise that this isn't working. Please switch to a newer kernel and re-test.

Apart from that there isn't much we can do about it.
Comment 6 yanhua 2019-09-04 12:50:15 UTC
As far as I know, arm64 does not support wc memory. and We have already turn the wc flag as newer kernel version does.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.