Bug 111551 - [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout
Summary: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: ARM Linux (All)
: not set major
Assignee: Default DRI bug account
QA Contact:
Depends on:
Reported: 2019-09-03 13:40 UTC by yanhua
Modified: 2019-09-04 12:50 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:

dmesg output (1.01 MB, text/plain)
2019-09-03 13:42 UTC, yanhua
no flags Details
The previous dmesg.txt has messages been overwriten. from the dmesg-full.txt can see more information (445.74 KB, text/plain)
2019-09-04 05:14 UTC, yanhua
no flags Details

Description yanhua 2019-09-03 13:40:26 UTC
The amdgpu(pollaries10, wx5100) drm drivers sometimes report:

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865

and many threads run into disk sleeping state

kernel version: 4.19.36

mesa: 18.3.6
Comment 1 yanhua 2019-09-03 13:42:45 UTC
Created attachment 145253 [details]
dmesg output

grep drm dmesg.txt. there are sdma1 ring timout
Comment 2 yanhua 2019-09-04 05:14:56 UTC
Created attachment 145260 [details]
The previous  dmesg.txt has  messages  been overwriten. from the dmesg-full.txt can see more information
Comment 3 Christian König 2019-09-04 11:45:27 UTC
As far as I can see this is a really large box with multiple GPUs installed.

The SDMA rarely locks up, especially not while executing page table updates. So there is most likely something wrong with the hardware here.

Are you sure that the power supply is large enough for that system?

What system/platform is that? Could this be a coherency problem?
Comment 4 yanhua 2019-09-04 12:26:49 UTC
I have asked hardware team, they have tested, and can be sure there are no power supply problem.

The system is arm64 with 64 cores. and there are three amdgpu card in the board.

there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring timeout occur, we can use amd supplied tools umr to read chip registers. can we know the real cause from the register value?

with the coherency problem you said, I think if that was true. the problem should occur more frequently. I'm not sure.
Comment 5 Christian König 2019-09-04 12:35:52 UTC
amdgpu is known to not work on arm64 until very recently.

So it is not a supprise that this isn't working. Please switch to a newer kernel and re-test.

Apart from that there isn't much we can do about it.
Comment 6 yanhua 2019-09-04 12:50:15 UTC
As far as I know, arm64 does not support wc memory. and We have already turn the wc flag as newer kernel version does.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.