107762 – [Intel GFX CI] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137

Bug 107762 - [Intel GFX CI] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137

Summary: [Intel GFX CI] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	high normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-08-30 16:20 UTC by Martin Peres
Modified:	2019-03-08 16:27 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments

Description Martin Peres 2018-08-30 16:20:20 UTC

On Intel's KBLg platform, executing the test igt@amdgpu_amd_cs_nop@sync-fork-gfx0 generated the following messages in dmesg:

[  358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
[  358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4733/fi-kbl-8809g/igt@amdgpu_amd_cs_nop@sync-fork-gfx0.html

Comment 1 Martin Peres 2018-09-06 14:39:39 UTC

This happened again:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu/amd_cs_nop@sync-fork-gfx0.html

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137

And it spilled(?) over the test igt@amdgpu_amd_prime@amd-to-i915:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu_amd_prime@amd-to-i915.html

[drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147

Since we did not get any feedback, I am bumping the priority a little in the hope to attract some attention.

Comment 2 Michel Dänzer 2018-09-06 15:16:07 UTC

(In reply to Martin Peres from comment #0)
> [  358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> [  358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145

(In reply to Martin Peres from comment #1)
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147

Hmm, signalled and emitted sequence numbers are always the same, meaning the hardware hasn't actually timed out?

I can think of two possibilities:

* A GPU scheduler bug causing the job timeout handling to be triggered spuriously. (Could something be stalling the system work queue, so the items scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in time?)

* A problem with the handling of the GPU's interrupts. Do the numbers on the amdgpu line in /proc/interrupts still increase after these messages appeared, or at least in the ten seconds before they appear?

Comment 3 Lucas Stach 2018-09-06 15:27:49 UTC

We've just fixed a bug in the scheduler timeout handling, which might lead to the timeout worker being armed for the wrong job.

Does this issue still occur on a recent kernel with 4823e5da2ea9061011242db81334d6ebbd2ed0a5 (drm/scheduler: fix timeout worker setup for out of order job completions) present?

Comment 4 Martin Peres 2018-09-06 15:30:11 UTC

(In reply to Michel Dänzer from comment #2)
> (In reply to Martin Peres from comment #0)
> > [  358.292609] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> > [  358.292635] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=145, emitted seq=145
> 
> (In reply to Martin Peres from comment #1)
> > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=137, emitted seq=137
> > [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=147, emitted seq=147
> 
> Hmm, signalled and emitted sequence numbers are always the same, meaning the
> hardware hasn't actually timed out?
> 
> I can think of two possibilities:
> 
> * A GPU scheduler bug causing the job timeout handling to be triggered
> spuriously. (Could something be stalling the system work queue, so the items
> scheduled by drm_sched_job_finish_cb can't call drm_sched_job_finish in
> time?)
> 
> * A problem with the handling of the GPU's interrupts. Do the numbers on the
> amdgpu line in /proc/interrupts still increase after these messages
> appeared, or at least in the ten seconds before they appear?

Here is the IGT run log:

[283/301] skip: 65, pass: 218 -
running: igt/amdgpu/amd_cs_nop/sync-fork-gfx0

[283/301] skip: 65, pass: 218 \              
dmesg-warn: igt/amdgpu/amd_cs_nop/sync-fork-gfx0

[284/301] skip: 65, pass: 218, dmesg-warn: 1 \
running: igt/amdgpu/amd_prime/i915-to-amd     

[284/301] skip: 65, pass: 218, dmesg-warn: 1 |
pass: igt/amdgpu/amd_prime/i915-to-amd        

[285/301] skip: 65, pass: 219, dmesg-warn: 1 |
running: igt/amdgpu/amd_prime/amd-to-i915     

[285/301] skip: 65, pass: 219, dmesg-warn: 1 /
dmesg-fail: igt/amdgpu/amd_prime/amd-to-i915  

It shows that both the tests #283 and #285 generated the timeout, yet the seqno has increased by 10 between the two tests, suggesting that the GPU is not hung.

I can't easily check if interrupts in /proc/interrupts are still increasing on a machine that is part of our CI, but I guess if this is what it takes to get this bug forward, I will try to get my hands on a KBLg platform and help you trigger.

However, if it is scheduler bug, I would assume this issue to be reproducible on any AMD GPU. Can you try running igt@amdgpu/amd_cs_nop@sync-fork-gfx0 in a loop for an hour or so?

Your second proposal would point at a KBLg-specific bug, but let's first rule out the scheduler as being part of the problem.

In any case, thanks for your answer :)

Comment 5 Martin Peres 2018-09-06 15:33:05 UTC

(In reply to Lucas Stach from comment #3)
> We've just fixed a bug in the scheduler timeout handling, which might lead
> to the timeout worker being armed for the wrong job.
> 
> Does this issue still occur on a recent kernel with
> 4823e5da2ea9061011242db81334d6ebbd2ed0a5 (drm/scheduler: fix timeout worker
> setup for out of order job completions) present?

I see! We are using drmtip, which does not yet include 4823e5da2ea9061011242db81334d6ebbd2ed0a5. I assume we'll get it very soon, but given the reproduction rate of the issue, we will not know for sure if the issue is fixed for at least another month (it happened twice in a week, 2% of our runs).

Comment 6 Chris Wilson 2018-09-06 15:34:36 UTC

Since it just happened again with drm-tip 60edf94611d2 that includes 4823e5da2e
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4782/fi-kbl-8809g/igt@amdgpu_amd_cs_nop@sync-fork-gfx0.html
you are in luck.

Comment 7 Michel Dänzer 2018-09-06 15:51:59 UTC

(In reply to Martin Peres from comment #4)
> However, if it is scheduler bug, I would assume this issue to be
> reproducible on any AMD GPU. Can you try running
> igt@amdgpu/amd_cs_nop@sync-fork-gfx0 in a loop for an hour or so?

Hmm, 'fork' sounded suspicious, and indeed AFAICT this test uses the same amdgpu_context_handle (and by extension the same DRM file descriptor) in multiple processes spawned by fork(). I'm afraid Christian is going to explain why that's not supported. :)

I suppose we should at least handle this more gracefully though, if possible.

Comment 8 Michel Dänzer 2018-09-06 16:15:28 UTC

BTW, any IGT patches containing amdgpu specific code should probably be Cc'd to the amd-gfx mailing list for review.

Comment 9 Christian König 2018-09-06 17:54:06 UTC

(In reply to Michel Dänzer from comment #7)
> Hmm, 'fork' sounded suspicious, and indeed AFAICT this test uses the same
> amdgpu_context_handle (and by extension the same DRM file descriptor) in
> multiple processes spawned by fork(). I'm afraid Christian is going to
> explain why that's not supported. :)

Actually the kernel driver can work with that. It isn't supported by Mesa.

Christian.

Comment 10 Martin Peres 2019-03-08 15:58:06 UTC

This used to happen every 30 runs. Now not seen for 760 runs. Closing!

Comment 11 CI Bug Log 2019-03-08 16:27:15 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.