Hi I've been seeing these errors in my kernel logs: amdgpu 0000:01:00.0: couldn't schedule ib [drm:amdgpu_job_run] *ERROR* Error scheduling IBs (-22) [drm:amd_sched_main] *ERROR* Failed to run job! I've bisected it down to: a7c77c7fe5f659428e73d77aa4a8ac80b638daf3 is the first bad commit commit a7c77c7fe5f659428e73d77aa4a8ac80b638daf3 Author: Christian König <christian.koenig@amd.com> Date: Wed Jun 15 13:44:05 2016 +0200 drm/amdgpu: pipeline evictions as well This boosts Xonotic from 38fps to 47fps when artificially limiting VRAM to 256MB for testing. It should improve all CPU bound rendering situations where we have a lot of swapping to/from VRAM. Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> :040000 040000 2fdfd546d7175759ed5f09bdec209f71d084ab1e 6b0cdbc8f42f4e3873e3bbdcc440336856073883 M drivers
Created attachment 124602 [details] journalctl output
Reverting that commit makes the errors go away
Could be some kind of race condition, please provide the output of "journalctl --dmesg -o short-monotonic".
Created attachment 124607 [details] demsg output
Ok, clearly not a race condition but something is wrong here the cause the driver tries to initialize the ring buffers multiple times. Maybe something is causing a GPU reset, but as far as I remember those should still be turned of by default. Please provide a journalctl output from a boot with the patch in question reverted.
I should probably have specified this is a prime laptop with dynpm The card initialises each time it's needed, it always does this during boot and again when X starts, and each time I load a game with DRI_PRIME=1
Ah, enlightenment! Thanks that was the info I was missing. We probably just forget to wait for all evictions before we turn of the GPU resulting in the still running jobs to produce this error message. Give me a second to hack together a patch.
Created attachment 124616 [details] [review] Possible fix Please test the attached patch it should fix the issue.
I'll test this tonight when I get home, thanks
Created attachment 124623 [details] dmesg Still seems to happen
I'm running out of ideas. Does that have any other negative results except for the error message? Alex any idea what else could cause an eviction during switching of the dGPU?
(In reply to Christian König from comment #11) > I'm running out of ideas. Does that have any other negative results except > for the error message? > > Alex any idea what else could cause an eviction during switching of the dGPU? powering up/down the dGPU should hit the same code as resume/suspend. Are you seeing similar issues with suspend and resume? Maybe the scheduler isn't getting stopped properly on suspend? We recently fixed something like this for gpu reset.
Created attachment 124672 [details] dmesg It seems to spam the logs more when I fire up a game
Created attachment 124708 [details] [review] Additional fix. Alex suspend/resume idea was the right approach. I was able to reproduce the issue and so find a pretty fundamental bug in one of my recent patches. Please see the additional fix, together with the first patch it should resolve the issue.
The patch already seems to have landed in drm-next-4.8-wip and it does indeed seem to fix it
Marking as resolved per comment #15 and https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.8-wip&id=ce774d0254ed05ff6d3e3ce2c598aa4f79d45c3c
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.