Bug 112242

Summary: amdgpu [RX Vega 56]: ring sdma0 timeout
Product: DRI Reporter: Matthias Heinz <mh>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: not set    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
i915 platform: i915 features:

Description Matthias Heinz 2019-11-11 09:33:46 UTC

I've reported this over at bugzilla.kernel.org but didn't get any help there. Maybe because nobody is expecting bugreports about the amdgpu driver over on the kernels bugtracker?

So this started a while ago, when I updated from 5.0.0 to a newer kernel. I'm currently at 5.3.0 and for almost any game I play I run into this problem:

Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=368056, emitted seq=368057
Aug 24 11:13:33 egalite kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR* [CRTC:47:crtc-0] flip_done timed out
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process 7DaysToDie.x86_ pid 8108 thread 7DaysToDie:cs0
Aug 24 11:13:33 egalite kernel: amdgpu 0000:0c:00.0: GPU reset begin!
Aug 24 11:13:33 egalite kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

Only a hard reset made me recover from that.

I did some kernel traces which I will copy over to this report, if necessary, but for now you can download them here: https://bugzilla.kernel.org/show_bug.cgi?id=204683

It also looks a bit like this bug: https://bugzilla.kernel.org/show_bug.cgi?id=201957 , because I also get the "ring gfx timeout". And there are lots and lots of people having this issue.

I tried bisecting it, but failed, because either I missed the commit that causes this, because there are multiple reasons why this happens or this really goes way back to the time, where 4.18 was the base for drm-next (which doesn't compile on modern compilers anymore. Also steam doesn't want to run on those old kernels, so even when I was able to compile an older kernel, there was no way to test them)

I even tried debugging it over ethernet (KGDBoE is a nice thing if you need performance), but somehow this slowed everything down enough to not trigger the bug.

I also tried the suggestions from https://bugs.freedesktop.org/show_bug.cgi?id=109955, but forbidding the lowest clock mode doesn't help either. (It fixes my RocketLeague problems, though).

Please advise what I should try next.

Best regards
Comment 1 Martin Peres 2019-11-19 10:01:14 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/953.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.