There has been some instances of tests igt@kms_cursor_legacy@pipe-* failing with [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:65:pipe B] cleanup_done timed out Seen on Apollo Lake, Kaby Lake, Whiskey Lake. Example trace: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4422/shard-apl3/igt@kms_cursor_legacy@pipe-b-forked-move.html # stdout IGT-Version: 1.22-g19922005 (x86_64) (Linux: 4.18.0-rc3-CI-CI_DRM_4422+ x86_64) Total updates 53012 (median of 4 processes is 13250.50) Subtest pipe-B-forked-move: SUCCESS (21.686s) Test requirement not met in function __real_main1358, file ../tests/kms_cursor_legacy.c:1378: Test requirement: !(n >= display.n_pipes) # dmesg [ 431.073104] [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:65:pipe B] cleanup_done timed out
Also on igt@kms_cursor_legacy@all-pipes-torture-bo
Also seen on SNB: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4542/shard-snb4/igt@kms_cursor_legacy@all-pipes-torture-bo.html [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:51:pipe B] cleanup_done timed out
commit 41db645a33e775855aeeec1a437d5c1e24ff6c88 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Jul 12 12:57:29 2018 +0100 drm/i915: Bump priority of clean up work We require that we keep the list of outstanding work short so that we do not "leak" memory while pageflipping under stress. However that system stress may delay kernel workers virtually indefinitely, which incurs the pageflips stall and eventually hit a timeout waiting for the cleanup. Try to combat CPU starvation of our short-lived cleanup workers by switching to a high priority workqueue. Testcase: igt/kms_cursor_legacy/all-pipes-torture-move References: https://bugs.freedesktop.org/show_bug.cgi?id=107122 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180712115729.3506-1-chris@chris-wilson.co.uk looks like it does the trick.
(In reply to Chris Wilson from comment #3) > commit 41db645a33e775855aeeec1a437d5c1e24ff6c88 (HEAD -> > drm-intel-next-queued, drm-intel/drm-intel-next-queued) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Jul 12 12:57:29 2018 +0100 > > drm/i915: Bump priority of clean up work > > We require that we keep the list of outstanding work short so that we do > not "leak" memory while pageflipping under stress. However that system > stress may delay kernel workers virtually indefinitely, which incurs the > pageflips stall and eventually hit a timeout waiting for the cleanup. > > Try to combat CPU starvation of our short-lived cleanup workers by > switching to a high priority workqueue. > > Testcase: igt/kms_cursor_legacy/all-pipes-torture-move > References: https://bugs.freedesktop.org/show_bug.cgi?id=107122 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Daniel Vetter <daniel.vetter@ffwll.ch> > Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180712115729.3506-1- > chris@chris-wilson.co.uk > > looks like it does the trick. It indeed did the trick! Thanks!
Well, seems like there are still cases where this can happen... https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-bwr-2160/igt@kms_cursor_legacy@all-pipes-torture-bo.html <3> [73.727461] [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:41:pipe B] cleanup_done timed out
Update: Last seen on ICL, drmtip_95 (3 months, 3 weeks / 1995 runs ago).
The failure rate of this issue is rather low: once every 50.8 runs, across all the platforms. This was calculated by just looking at the shards results. It looks like the issue has been fixed, as it has not been seen since CI_DRM_5670 (238 runs ago). However, the reproduction rate has fluctated a lot throughout the history of the bug, so it seems very timing-sensitive. We'll wait until CI_DRM_6178 to verify that this is indeed fixed and not that we just have been lucky! The test is spawning $nproc x 2 children: half of them are continuously updating the cursor while the other half is hogging the CPUs. A set of both types gets pinned on each available CPU. The test has no asserts and just tries to see if anything blows up. Based on the error, it seems like the issue is that the cleanup callback is not called fast-enough after a flip happens, which may lead to memory leaks. Since the reproduction rate of this issue is extremely low *even* with such a stress test, this should not have any significant user impact. Let's see if we can reduce the occurrence rate of this issue even more, then close it as NOTABUG because Linux is not an RTOS and we cannot guarantee any timings.
Note that this test is based on a real bug report (long ago) about flip workers being starved leading to further mempressure causing more system slowdown. We have to abuse the system a lot to make it even likely to happen under test conditions; but users, users have a magic all of their own.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/128.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.