The amdgpu(pollaries10, wx5100) drm drivers sometimes report:
[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=24423862, emitted seq=24423865
and many threads run into disk sleeping state
kernel version: 4.19.36
Created attachment 145253 [details]
grep drm dmesg.txt. there are sdma1 ring timout
Created attachment 145260 [details]
The previous dmesg.txt has messages been overwriten. from the dmesg-full.txt can see more information
As far as I can see this is a really large box with multiple GPUs installed.
The SDMA rarely locks up, especially not while executing page table updates. So there is most likely something wrong with the hardware here.
Are you sure that the power supply is large enough for that system?
What system/platform is that? Could this be a coherency problem?
I have asked hardware team, they have tested, and can be sure there are no power supply problem.
The system is arm64 with 64 cores. and there are three amdgpu card in the board.
there are rarely gfx timeout, sdma timeout, and vce timeout. When the ring timeout occur, we can use amd supplied tools umr to read chip registers. can we know the real cause from the register value?
with the coherency problem you said, I think if that was true. the problem should occur more frequently. I'm not sure.
amdgpu is known to not work on arm64 until very recently.
So it is not a supprise that this isn't working. Please switch to a newer kernel and re-test.
Apart from that there isn't much we can do about it.
As far as I know, arm64 does not support wc memory. and We have already turn the wc flag as newer kernel version does.