On Intel's KBLg platform, executing the test igt@amdgpu_amd_cs_nop@sync-fork-gfx0 generated the following messages in dmesg: [ 358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 [ 358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145 Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4733/fi-kbl-8809g/igt@amdgpu_amd_cs_nop@sync-fork-gfx0.html
This happened again: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu/amd_cs_nop@sync-fork-gfx0.html [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 And it spilled(?) over the test igt@amdgpu_amd_prime@amd-to-i915: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu_amd_prime@amd-to-i915.html [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147 Since we did not get any feedback, I am bumping the priority a little in the hope to attract some attention.
(In reply to Martin Peres from comment #0) > [ 358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > [ 358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145 (In reply to Martin Peres from comment #1) > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147 Hmm, signalled and emitted sequence numbers are always the same, meaning the hardware hasn't actually timed out? I can think of two possibilities: * A GPU scheduler bug causing the job timeout handling to be triggered spuriously. (Could something be stalling the system work queue, so the items scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in time?) * A problem with the handling of the GPU's interrupts. Do the numbers on the amdgpu line in /proc/interrupts still increase after these messages appeared, or at least in the ten seconds before they appear?
We've just fixed a bug in the scheduler timeout handling, which might lead to the timeout worker being armed for the wrong job. Does this issue still occur on a recent kernel with 4823e5da2ea9061011242db81334d6ebbd2ed0a5 (drm/scheduler: fix timeout worker setup for out of order job completions) present?
(In reply to Michel Dänzer from comment #2) > (In reply to Martin Peres from comment #0) > > [ 358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > > [ 358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145 > > (In reply to Martin Peres from comment #1) > > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137 > > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147 > > Hmm, signalled and emitted sequence numbers are always the same, meaning the > hardware hasn't actually timed out? > > I can think of two possibilities: > > * A GPU scheduler bug causing the job timeout handling to be triggered > spuriously. (Could something be stalling the system work queue, so the items > scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in > time?) > > * A problem with the handling of the GPU's interrupts. Do the numbers on the > amdgpu line in /proc/interrupts still increase after these messages > appeared, or at least in the ten seconds before they appear? Here is the IGT run log: [283/301] skip: 65, pass: 218 - running: igt/amdgpu/amd_cs_nop/sync-fork-gfx0 [283/301] skip: 65, pass: 218 \ dmesg-warn: igt/amdgpu/amd_cs_nop/sync-fork-gfx0 [284/301] skip: 65, pass: 218, dmesg-warn: 1 \ running: igt/amdgpu/amd_prime/i915-to-amd [284/301] skip: 65, pass: 218, dmesg-warn: 1 | pass: igt/amdgpu/amd_prime/i915-to-amd [285/301] skip: 65, pass: 219, dmesg-warn: 1 | running: igt/amdgpu/amd_prime/amd-to-i915 [285/301] skip: 65, pass: 219, dmesg-warn: 1 / dmesg-fail: igt/amdgpu/amd_prime/amd-to-i915 It shows that both the tests #283 and #285 generated the timeout, yet the seqno has increased by 10 between the two tests, suggesting that the GPU is not hung. I can't easily check if interrupts in /proc/interrupts are still increasing on a machine that is part of our CI, but I guess if this is what it takes to get this bug forward, I will try to get my hands on a KBLg platform and help you trigger. However, if it is scheduler bug, I would assume this issue to be reproducible on any AMD GPU. Can you try running igt@amdgpu/amd_cs_nop@sync-fork-gfx0 in a loop for an hour or so? Your second proposal would point at a KBLg-specific bug, but let's first rule out the scheduler as being part of the problem. In any case, thanks for your answer :)
(In reply to Lucas Stach from comment #3) > We've just fixed a bug in the scheduler timeout handling, which might lead > to the timeout worker being armed for the wrong job. > > Does this issue still occur on a recent kernel with > 4823e5da2ea9061011242db81334d6ebbd2ed0a5 (drm/scheduler: fix timeout worker > setup for out of order job completions) present? I see! We are using drmtip, which does not yet include 4823e5da2ea9061011242db81334d6ebbd2ed0a5. I assume we'll get it very soon, but given the reproduction rate of the issue, we will not know for sure if the issue is fixed for at least another month (it happened twice in a week, 2% of our runs).
Since it just happened again with drm-tip 60edf94611d2 that includes 4823e5da2e https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu_amd_cs_nop@sync-fork-gfx0.html you are in luck.
(In reply to Martin Peres from comment #4) > However, if it is scheduler bug, I would assume this issue to be > reproducible on any AMD GPU. Can you try running > igt@amdgpu/amd_cs_nop@sync-fork-gfx0 in a loop for an hour or so? Hmm, 'fork' sounded suspicious, and indeed AFAICT this test uses the same amdgpu_context_handle (and by extension the same DRM file descriptor) in multiple processes spawned by fork(). I'm afraid Christian is going to explain why that's not supported. :) I suppose we should at least handle this more gracefully though, if possible.
BTW, any IGT patches containing amdgpu specific code should probably be Cc'd to the amd-gfx mailing list for review.
(In reply to Michel Dänzer from comment #7) > Hmm, 'fork' sounded suspicious, and indeed AFAICT this test uses the same > amdgpu_context_handle (and by extension the same DRM file descriptor) in > multiple processes spawned by fork(). I'm afraid Christian is going to > explain why that's not supported. :) Actually the kernel driver can work with that. It isn't supported by Mesa. Christian.
This used to happen every 30 runs. Now not seen for 760 runs. Closing!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.