Bug 102262

Summary: [SNB CI] multi-minute cpu stall when running kms_flip@blt-wf_vblank-vs-dpms|modeset
Product: DRI Reporter: Daniel Vetter <daniel>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: daniel, intel-gfx-bugs, martin.peres, tomi.p.sarvela
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: SNB i915 features: GPU hang
Attachments:
Description Flags
netconsole capture right around the stall none

Description Daniel Vetter 2017-08-16 14:11:56 UTC
This bug just for recording for posterity what we found out:

On snb CI shards (gt1, but I managed to kill my gt2 a few times too) the system can seemingly hard-hang when running the above testcases. This was tested on igt commits

commit c8811338e8a7723b5e99a303361ed97c092fc270 (HEAD -> master, fdo/master)
Author: Kelvin Gardiner <kelvin.gardiner@intel.com>
Date:   Tue Jun 27 14:04:51 2017 -0700

    intel-ci: Add fast-feedback-simulation.testlist

Kernel integration manifest is roughly

drm-intel drm-intel-fixes 781cc76e0c2469cb7ac12ba238a4ea006978e321
        drm/i915: Avoid the gpu reset vs. modeset deadlock
drm-upstream drm-fixes 46828dc77961d9286e55671c4dd3b6c9effadf1a
        Merge branch 'linux-4.13' of git://github.com/skeggsb/linux into drm-fixes
drm-intel drm-intel-next-fixes 04941829b0049d2446c7042ab9686dd057d809a6
        drm/i915: Hold RPM wakelock while initializing OA buffer
drm-intel drm-intel-next-queued 4e34935fcf691b2f553fdc34502d649bf979a06f
        drm/i915/cnl: Setup PAT Index.
drm-upstream drm-next 0c697fafc66830ca7d5dc19123a1d0641deaa1f6
        Backmerge tag 'v4.13-rc5' into drm-next
sound-upstream for-next c9480d055e306a855f8a8d2b3b097773cd0d5ad0
        sound: emu8000: constify emu8000_ops
sound-upstream for-linus a8e800fe0f68bc28ce309914f47e432742b865ed
        ALSA: usb-audio: Apply sample rate quirk to Sennheiser headset
drm-intel topic/core-for-CI 01cbe29aa8f8d7ffca23cf6e147a17529fae680e
        e1000e: fix buffer overrun while the I219 is processing DMA transactions
drm-misc drm-misc-next b9c55b6e2cc4369b0688961fa5de0e057f3ec0c4
        drm/vc4: Continue the switch to drm_*_put() helpers
drm-misc drm-misc-next-fixes 1ed134e6526b1b513a14fba938f6d96aa1c7f3dd
        drm/vc4: Fix VBLANK handling in crtc->enable() path
drm-misc drm-misc-fixes a0ffc51e20e90e0c1c2491de2b4b03f48b6caaba
        drm/atomic: If the atomic check fails, return its value first


I'll attach a netconsole log of a typical death, but tldr is that we stall for a few minutes (with not even the NMI watchdog being able to do anything) until eventually the system recovers and the batch completes and the dpms/modeset-off goes through.
Comment 1 Daniel Vetter 2017-08-16 14:14:14 UTC
Created attachment 133552 [details]
netconsole capture right around the stall

Includes the following debug patch applied on top:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index decf5da63950..15582af42be7 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -10485,6 +10485,8 @@ static int intel_crtc_atomic_check(struct drm_crtc *crtc,
                        return ret;
        }
 
+       printk("after clock compute\n");
+
        if (crtc_state->color_mgmt_changed) {
                ret = intel_color_check(crtc, crtc_state);
                if (ret)
@@ -12025,6 +12027,8 @@ static int intel_atomic_check(struct drm_device *dev,
        if (ret)
                return ret;
 
+       printk("after check_modeset\n");
+
        for_each_oldnew_crtc_in_state(state, crtc, old_crtc_state, crtc_state, i) {
                struct intel_crtc_state *pipe_config =
                        to_intel_crtc_state(crtc_state);
@@ -12089,7 +12093,11 @@ static int intel_atomic_check(struct drm_device *dev,
                return ret;
 
        intel_fbc_choose_crtc(dev_priv, state);
-       return calc_watermark_data(state);
+       ret = calc_watermark_data(state);
+
+       printk("end of atomic_check\n");
+
+       return ret;
 }
 
 static int intel_atomic_prepare_commit(struct drm_device *dev,
@@ -12343,7 +12351,9 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
        unsigned crtc_vblank_mask = 0;
        int i;
 
+       printk("before wait\n");
        intel_atomic_commit_fence_wait(intel_state);
+       printk("after wait\n");
 
        drm_atomic_helper_wait_for_dependencies(state);
 
@@ -12573,6 +12583,8 @@ static int intel_atomic_commit(struct drm_device *dev,
                return ret;
        }
 
+       printk("after atomic prepare commit\n");
+
        /*
         * The intel_legacy_cursor_update() fast path takes care
         * of avoiding the vblank waits for simple cursor
Comment 2 Daniel Vetter 2017-08-16 14:34:14 UTC
commit f978cc027cd02a6c43b54b69fab2b538bbe05330 (HEAD -> master, fdo/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Aug 16 14:39:15 2017 +0100

    lib/dummyload: Pad with a few nops so that we do not completely hog the system


Fingers crossed.
Comment 3 Elizabeth 2018-02-13 16:50:04 UTC
Hello Daniel, have this been verified, can we close this bug? Thanks.
Comment 4 Jani Saarinen 2018-04-20 11:11:50 UTC
Closing, please re-open if still occurs.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.