Bug 96588 - [regression] [amdgpu] Errors scheduling IBs
Summary: [regression] [amdgpu] Errors scheduling IBs
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-19 09:01 UTC by Mike Lothian
Modified: 2016-06-29 14:11 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
journalctl output (254.10 KB, text/plain)
2016-06-19 09:03 UTC, Mike Lothian
no flags Details
demsg output (84.56 KB, text/plain)
2016-06-19 20:12 UTC, Mike Lothian
no flags Details
Possible fix (1.44 KB, patch)
2016-06-20 11:24 UTC, Christian König
no flags Details | Splinter Review
dmesg (67.78 KB, text/plain)
2016-06-20 17:28 UTC, Mike Lothian
no flags Details
dmesg (112.73 KB, text/plain)
2016-06-23 01:22 UTC, Mike Lothian
no flags Details
Additional fix. (1.04 KB, patch)
2016-06-24 19:50 UTC, Christian König
no flags Details | Splinter Review

Description Mike Lothian 2016-06-19 09:01:41 UTC
Hi

I've been seeing these errors in my kernel logs:

amdgpu 0000:01:00.0: couldn't schedule ib
[drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22)
[drm:amd_sched_main] *ERROR* Failed to run job!

I've bisected it down to:

a7c77c7fe5f659428e73d77aa4a8ac80b638daf3 is the first bad commit
commit a7c77c7fe5f659428e73d77aa4a8ac80b638daf3
Author: Christian König <christian.koenig@amd.com>
Date:   Wed Jun 15 13:44:05 2016 +0200

    drm/amdgpu: pipeline evictions as well
    
    This boosts Xonotic from 38fps to 47fps when artificially limiting VRAM to
    256MB for testing. It should improve all CPU bound rendering situations
    where we have a lot of swapping to/from VRAM.
    
    Reviewed-by: Alex Deucher <alexander.deucher@amd.com>
    Signed-off-by: Christian König <christian.koenig@amd.com>
    Signed-off-by: Alex Deucher <alexander.deucher@amd.com>

:040000 040000 2fdfd546d7175759ed5f09bdec209f71d084ab1e 6b0cdbc8f42f4e3873e3bbdcc440336856073883 M      drivers
Comment 1 Mike Lothian 2016-06-19 09:03:56 UTC
Created attachment 124602 [details]
journalctl output
Comment 2 Mike Lothian 2016-06-19 13:59:08 UTC
Reverting that commit makes the errors go away
Comment 3 Christian König 2016-06-19 20:07:05 UTC
Could be some kind of race condition, please provide the output of "journalctl --dmesg -o short-monotonic".
Comment 4 Mike Lothian 2016-06-19 20:12:46 UTC
Created attachment 124607 [details]
demsg output
Comment 5 Christian König 2016-06-20 09:10:29 UTC
Ok, clearly not a race condition but something is wrong here the cause the driver tries to initialize the ring buffers multiple times.

Maybe something is causing a GPU reset, but as far as I remember those should still be turned of by default.

Please provide a journalctl output from a boot with the patch in question reverted.
Comment 6 Mike Lothian 2016-06-20 09:33:27 UTC
I should probably have specified this is a prime laptop with dynpm 

The card initialises each time it's needed, it always does this during boot and again when X starts, and each time I load a game with DRI_PRIME=1
Comment 7 Christian König 2016-06-20 09:45:25 UTC
Ah, enlightenment! Thanks that was the info I was missing.

We probably just forget to wait for all evictions before we turn of the GPU resulting in the still running jobs to produce this error message.

Give me a second to hack together a patch.
Comment 8 Christian König 2016-06-20 11:24:37 UTC
Created attachment 124616 [details] [review]
Possible fix

Please test the attached patch it should fix the issue.
Comment 9 Mike Lothian 2016-06-20 11:57:18 UTC
I'll test this tonight when I get home, thanks
Comment 10 Mike Lothian 2016-06-20 17:28:58 UTC
Created attachment 124623 [details]
dmesg

Still seems to happen
Comment 11 Christian König 2016-06-21 09:52:01 UTC
I'm running out of ideas. Does that have any other negative results except for the error message?

Alex any idea what else could cause an eviction during switching of the dGPU?
Comment 12 Alex Deucher 2016-06-21 13:38:39 UTC
(In reply to Christian König from comment #11)
> I'm running out of ideas. Does that have any other negative results except
> for the error message?
> 
> Alex any idea what else could cause an eviction during switching of the dGPU?

powering up/down the dGPU should hit the same code as resume/suspend.  Are you seeing similar issues with suspend and resume?  Maybe the scheduler isn't getting stopped properly on suspend?  We recently fixed something like this for gpu reset.
Comment 13 Mike Lothian 2016-06-23 01:22:16 UTC
Created attachment 124672 [details]
dmesg

It seems to spam the logs more when I fire up a game
Comment 14 Christian König 2016-06-24 19:50:08 UTC
Created attachment 124708 [details] [review]
Additional fix.

Alex suspend/resume idea was the right approach.

I was able to reproduce the issue and so find a pretty fundamental bug in one of my recent patches.

Please see the additional fix, together with the first patch it should resolve the issue.
Comment 15 Mike Lothian 2016-06-25 13:30:43 UTC
The patch already seems to have landed in drm-next-4.8-wip and it does indeed seem to fix it


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.