Bug 107122 - [CI][DRMTIP] igt@kms_cursor_legacy@pipe - cleanup_done timed out
Summary: [CI][DRMTIP] igt@kms_cursor_legacy@pipe - cleanup_done timed out
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: low normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-05 09:57 UTC by Tomi Sarvela
Modified: 2019-04-11 11:41 UTC (History)
1 user (show)

See Also:
i915 platform: ALL
i915 features: display/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Tomi Sarvela 2018-07-05 09:57:09 UTC
There has been some instances of tests igt@kms_cursor_legacy@pipe-* failing with

[drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:65:pipe B] cleanup_done timed out

Seen on Apollo Lake, Kaby Lake, Whiskey Lake.

Example trace:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4422/shard-apl3/igt@kms_cursor_legacy@pipe-b-forked-move.html

# stdout
IGT-Version: 1.22-g19922005 (x86_64) (Linux: 4.18.0-rc3-CI-CI_DRM_4422+ x86_64)
Total updates 53012 (median of 4 processes is 13250.50)
Subtest pipe-B-forked-move: SUCCESS (21.686s)
Test requirement not met in function __real_main1358, file ../tests/kms_cursor_legacy.c:1378:
Test requirement: !(n >= display.n_pipes)

# dmesg
[  431.073104] [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:65:pipe B] cleanup_done timed out
Comment 1 Tomi Sarvela 2018-07-09 08:25:20 UTC
Also on igt@kms_cursor_legacy@all-pipes-torture-bo
Comment 2 Martin Peres 2018-07-16 07:53:22 UTC
Also seen on SNB: https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4542/shard-snb4/igt@kms_cursor_legacy@all-pipes-torture-bo.html

[drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:51:pipe B] cleanup_done timed out
Comment 3 Chris Wilson 2018-08-13 12:59:07 UTC
commit 41db645a33e775855aeeec1a437d5c1e24ff6c88 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 12 12:57:29 2018 +0100

    drm/i915: Bump priority of clean up work
    
    We require that we keep the list of outstanding work short so that we do
    not "leak" memory while pageflipping under stress. However that system
    stress may delay kernel workers virtually indefinitely, which incurs the
    pageflips stall and eventually hit a timeout waiting for the cleanup.
    
    Try to combat CPU starvation of our short-lived cleanup workers by
    switching to a high priority workqueue.
    
    Testcase: igt/kms_cursor_legacy/all-pipes-torture-move
    References: https://bugs.freedesktop.org/show_bug.cgi?id=107122
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180712115729.3506-1-chris@chris-wilson.co.uk

looks like it does the trick.
Comment 4 Martin Peres 2018-09-05 08:09:22 UTC
(In reply to Chris Wilson from comment #3)
> commit 41db645a33e775855aeeec1a437d5c1e24ff6c88 (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Jul 12 12:57:29 2018 +0100
> 
>     drm/i915: Bump priority of clean up work
>     
>     We require that we keep the list of outstanding work short so that we do
>     not "leak" memory while pageflipping under stress. However that system
>     stress may delay kernel workers virtually indefinitely, which incurs the
>     pageflips stall and eventually hit a timeout waiting for the cleanup.
>     
>     Try to combat CPU starvation of our short-lived cleanup workers by
>     switching to a high priority workqueue.
>     
>     Testcase: igt/kms_cursor_legacy/all-pipes-torture-move
>     References: https://bugs.freedesktop.org/show_bug.cgi?id=107122
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180712115729.3506-1-
> chris@chris-wilson.co.uk
> 
> looks like it does the trick.

It indeed did the trick! Thanks!
Comment 5 Martin Peres 2018-10-12 14:47:19 UTC
Well, seems like there are still cases where this can happen...

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_125/fi-bwr-2160/igt@kms_cursor_legacy@all-pipes-torture-bo.html

<3> [73.727461] [drm:drm_atomic_helper_setup_commit] *ERROR* [CRTC:41:pipe B] cleanup_done timed out
Comment 6 Lakshmi 2018-12-04 12:28:33 UTC
Update: Last seen on ICL, drmtip_95 (3 months, 3 weeks / 1995 runs ago).
Comment 7 Martin Peres 2019-04-11 11:36:25 UTC
The failure rate of this issue is rather low: once every 50.8 runs, across all the platforms. This was calculated by just looking at the shards results.

It looks like the issue has been fixed, as it has not been seen since CI_DRM_5670 (238 runs ago). However, the reproduction rate has fluctated a lot throughout the history of the bug, so it seems very timing-sensitive. We'll wait until CI_DRM_6178 to verify that this is indeed fixed and not that we just have been lucky!

The test is spawning $nproc x 2 children: half of them are continuously updating the cursor while the other half is hogging the CPUs. A set of both types gets pinned on each available CPU. The test has no asserts and just tries to see if anything blows up.

Based on the error, it seems like the issue is that the cleanup callback is not called fast-enough after a flip happens, which may lead to memory leaks. Since the reproduction rate of this issue is extremely low *even* with such a stress test, this should not have any significant user impact.

Let's see if we can reduce the occurrence rate of this issue even more, then close it as NOTABUG because Linux is not an RTOS and we cannot guarantee any timings.
Comment 8 Chris Wilson 2019-04-11 11:41:34 UTC
Note that this test is based on a real bug report (long ago) about flip workers being starved leading to further mempressure causing more system slowdown. We have to abuse the system a lot to make it even likely to happen under test conditions; but users, users have a magic all of their own.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.