101852 – [BAT][GDG] igt@drv_hangman@error-state-basic incomplete in CI

Bug 101852 - [BAT][GDG] igt@drv_hangman@error-state-basic incomplete in CI

Summary: [BAT][GDG] igt@drv_hangman@error-state-basic incomplete in CI

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	high critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:	https://patchwork.freedesktop.org/ser...
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-07-20 11:39 UTC by Martin Peres
Modified:	2017-09-06 13:47 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	I915G, I915GM
i915 features:	GPU hang

Attachments
Dmesg from gdg-551 igt@drv_hangman@error-state-basic (86.09 KB, text/plain) 2017-07-20 14:47 UTC, Tomi Sarvela	no flags	Details
View All

Description Martin Peres 2017-07-20 11:39:19 UTC

On the kernel CI_DRM_2852, the machine fi-gdg-551 hard hanged while running the test igt@drv_hangman@error-state-basicigt@drv_hangman@error-state-basic.

No suspicious logs available...

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2852/fi-gdg-551/igt@drv_hangman@error-state-basic.html

Comment 1 Tomi Sarvela 2017-07-20 14:47:38 UTC

Created attachment 132789 [details]
Dmesg from gdg-551 igt@drv_hangman@error-state-basic

Comment 2 Tomi Sarvela 2017-07-21 10:38:25 UTC

This log was captured with i915.reset=0. Without it, host rebooted instantly when running the test.

Comment 3 Chris Wilson 2017-07-28 14:10:33 UTC

Similar death for me on a i915gm (eeepc 700). Worse, it used to work. v3.19 can reset without death.

Comment 4 Chris Wilson 2017-07-30 13:46:58 UTC

First foray down the rabbit hole takes a wrong turn:

e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259 is the first bad commit
commit e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Tue Feb 16 10:06:14 2016 +0100

    drm/i915: Use atomic helpers for suspend, v2.

That commit inserts a deadlock early into the reset.

Last known working reset: 1db6e2e7dc2739181bd55c3c41263634803b3cc9 [v4.6]

Now for bisect attempt #2 trying to ignore the deadlock.

Comment 5 Chris Wilson 2017-08-04 11:54:27 UTC

And the not completely unsurprising result from the other side, is...

... drum roll ...

commit dfa2997055659b4e706a85fba481050cc7e7ad82
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Fri Aug 5 23:28:27 2016 +0300

    drm/i915: Fix modeset handling during gpu reset, v5.
    
    This function would call drm_modeset_lock_all, while the suspend/resume
    functions already have their own locking. Fix this by factoring out
    __intel_display_resume, and calling the atomic helpers for duplicating
    atomic state and disabling all crtc's during suspend.
    
    Changes since v1:
    - Deal with -EDEADLK right after lock_all and clean up calls
      to hw readout.
    - Always take all modeset locks so updates during gpu reset are blocked.
    Changes since v2:
    - Fix deadlock in intel_update_primary_planes.
    - Move WARN_ON(EDEADLK) to __intel_display_resume.
    - pctx -> ctx
    - only call __intel_display_resume on success in intel_display_resume.
    Changes since v3:
    - Rebase on top of dev_priv -> dev change.
    - Use drm_modeset_lock_all_ctx instead of drm_modeset_lock_all.
    Changes since v4 [by vsyrjala]:
    - Deal with skip_intermediate_wm
    - Update comment w.r.t. mode_config.mutex vs. ->detect()
    - Rebase due to INTEL_GEN() etc.
    
    Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Fixes: e2c8b8701e2d ("drm/i915: Use atomic helpers for suspend, v2.")
    Cc: stable@vger.kernel.org
    Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1470428910-12125-2-git-send-email-ville.syrjala@linux.intel.com
    (cherry picked from commit 739748939974791b84629a8790527a16f76873a4)
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>


But both are really a single patch implementing display reset using atomic.

Comment 6 Chris Wilson 2017-08-04 12:37:34 UTC

It boils down to the machine hang occuring during the drm_atomic_commit() of the saved state.

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index fbb8574..49a98b6 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3129,7 +3129,8 @@ __intel_display_resume(struct drm_device *dev,
        /* ignore any reset values/BIOS leftovers in the WM registers */
        to_intel_atomic_state(state)->skip_intermediate_wm = true;
 
-       ret = drm_atomic_commit(state);
+       ret = 0;
+       //ret = drm_atomic_commit(state);
 
        WARN_ON(ret == -EDEADLK);
        return ret;

Comment 7 Chris Wilson 2017-08-04 17:46:20 UTC

Reverted the display reset code back to e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259^ on top of dfa2997055659b4e706a85fba481050cc7e7ad82, and the machine still dies during the modeset on resume. So something changed in the modesetting routines between those two point?

printk debugging is not that informative, the hard machine death feels fairly random, presumably it dies when some incomplete state becomes active.

Comment 8 Chris Wilson 2017-08-07 17:40:20 UTC

(In reply to Chris Wilson from comment #7)
> Reverted the display reset code back to
> e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259^ on top of
> dfa2997055659b4e706a85fba481050cc7e7ad82, and the machine still dies during
> the modeset on resume. So something changed in the modesetting routines
> between those two point?

Similarly cherry-picking dfa2997055659b4e706a85fba481050cc7e7ad82 onto e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259 works. So definitely something went wrong in modesetting in between.

Comment 9 Chris Wilson 2017-08-10 11:50:51 UTC

After a third round of bisecting with manual patching:

bd18728a3f2edee33d4d241d20f394c798fb414 is the first bad commit
commit 9bd18728a3f2edee33d4d241d20f394c798fb414
Author: Ville Syrjälä <ville.syrjala@linux.intel.com>
Date:   Fri May 13 10:10:42 2016 -0700

    drm/i915: Ignore stale wm register values on resume on ilk-bdw (v2)
    
    When we resume the watermark register may contain some BIOS leftovers,
    or just the hardware reset values. We should ignore those as the
    pipes will be off anyway, and so frobbing around with intermediate
    watermarks doesn't make much sense.
    
    In fact I think we should just throw the skip_intermediate_wm flag
    out, and instead properly sanitize the "active" watermarks to match
    the current plane and pipe states. The actual wm state readout might
    also need a bit of work. But for now, let's continue with the
    skip_intermediate_wm to keep the fix more minimal.
    
    Fixes this sort of errors on resume
    [drm:ilk_validate_pipe_wm] LP0 watermark invalid
    [drm:intel_crtc_atomic_check] No valid intermediate pipe watermarks are possible
    [drm:intel_display_resume [i915]] *ERROR* Restoring old state failed with -22
    and a boatload of subsequent modeset BAT fails on my ILK.
    
    v2:
     - Rebase; the SKL atomic WM patches that just landed changed the WM
       structure fields in intel_crtc_state slightly.  (Matt)
    
    Cc: Matt Roper <matthew.d.roper@intel.com>
    Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Fixes: ed4a6a7ca853 ("drm/i915: Add two-stage ILK-style watermark programming (v11)")
    Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
    Signed-off-by: Matt Roper <matthew.d.roper@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1463159442-20478-1-git-send-email-matthew.d.roper@intel.com
    (cherry picked from commit e3d5457c7caabb77b3f1d0b09c4a63362e9b04d2)
    [Jani: rebase on drm-next while cherry-picking]
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>


Now starting to double check.

Comment 10 Chris Wilson 2017-08-10 15:48:59 UTC

Walking backwards from that bad commit is proving disappointing, fails for commits that previously passed. ARGH.

Comment 11 Chris Wilson 2017-09-04 20:08:12 UTC

Narrowed down to:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 347146f1ce7d..6ba47bc7b48e 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -12458,8 +12458,10 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state)
                }
        }
 
+#if 0
        /* Now enable the clocks, plane, pipe, and connectors that we set up. */
        dev_priv->display.update_crtcs(state, &crtc_vblank_mask);
+#endif
 
        /* FIXME: We should call drm_atomic_helper_commit_hw_done() here
         * already, but still need the state for the delayed optimization. To

Comment 12 Chris Wilson 2017-09-05 14:26:15 UTC

It dies on this write:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c
index 347146f1ce7d..ffa61d585cec 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -3337,7 +3337,7 @@ static void i9xx_update_primary_plane(struct intel_plane *primary,
                I915_WRITE_FW(PRIMCNSTALPHA(plane), 0);
        }
 
-       I915_WRITE_FW(reg, dspcntr);
+       //I915_WRITE_FW(reg, dspcntr);
 
        I915_WRITE_FW(DSPSTRIDE(plane), fb->pitches[0]);
        if (IS_HASWELL(dev_priv) || IS_BROADWELL(dev_priv)) {

Comment 13 Chris Wilson 2017-09-05 14:50:07 UTC

Nothing unusual about it either,

i9xx_update_primary_plane: plane=0, I915_WRITE_FW(459136, d9000000)

Comment 14 Chris Wilson 2017-09-05 15:04:29 UTC

So my suspicion is in the reset, the pipes are switched over and that our post-reset sanitize isn't clearing the conflict.

Comment 15 Ville Syrjala 2017-09-05 15:06:36 UTC

(In reply to Chris Wilson from comment #14)
> So my suspicion is in the reset, the pipes are switched over and that our
> post-reset sanitize isn't clearing the conflict.

Hmm. Isn't this a desktop part? No FBC on those so we shouldn't do any plane<->pipe switching. But somehow that write is trying to enable plane A on pipe B.

Comment 16 Chris Wilson 2017-09-05 15:16:29 UTC

I'm on a i915m, not quite the same hw as farm's gdg. Fingers crossed that whatever works for me, works for gdg as well.

Comment 17 Chris Wilson 2017-09-05 15:32:23 UTC

Ok, set both planes to zero, no effect. In __display_resume:

        for_each_pipe(dev_priv, pipe) {
                struct intel_crtc *crtc =
                        intel_get_crtc_for_pipe(dev_priv, pipe);
                enum plane plane = crtc->plane;

                I915_WRITE_FW(DSPCNTR(plane), 0);
                I915_WRITE_FW(DSPADDR(plane), 0);
        }

Comment 18 Ville Syrjala 2017-09-05 17:00:26 UTC

(In reply to Chris Wilson from comment #17)
> Ok, set both planes to zero, no effect. In __display_resume:
> 
>         for_each_pipe(dev_priv, pipe) {
>                 struct intel_crtc *crtc =
>                         intel_get_crtc_for_pipe(dev_priv, pipe);
>                 enum plane plane = crtc->plane;
> 
>                 I915_WRITE_FW(DSPCNTR(plane), 0);
>                 I915_WRITE_FW(DSPADDR(plane), 0);
>         }

Hmm. I wonder if gen3 still needs both DPLLs and/or pipes running when moving a plane between them. That was the case on 830, but I thought it got fixed on 855 already.

Comment 19 Chris Wilson 2017-09-05 20:50:34 UTC

https://patchwork.freedesktop.org/series/29845/

Comment 20 Chris Wilson 2017-09-06 13:21:02 UTC

commit 0db8c961209153498fe7e279b8f0d3deb81808f0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Sep 6 12:14:05 2017 +0100

    drm/i915: Re-enable GTT following a device reset
    
    Ville Syrjälä spotted that PGETBL_CTL was losing its enable bit upon a
    reset. That was causing the display to show garbage on his 945gm. On my
    i915gm the effect was far more severe; re-enabling the display following
    the reset without PGETBL_CTL being enabled lead to an immediate hard
    hang.
    
    We do have a routine to re-enable PGETBL_CTL which is applicable to
    gen2-4, although on gen4 it is documented that a graphics reset doesn't
    alter the register (no such wording is given for gen3) and should be safe
    to call to punch back in the enable bit. However, that leaves the question
    of whether we need to completely re-initialise the register and the
    rest of the GSM. For g33/pnv/gen4+, where we do have a configurable
    page table, its contents do seem to be kept, and so we should be able to
    recover without having to reinitialise the GTT from scratch (as prior to
    g33, that register is configured by the BIOS and we leave alone except
    for the enable bit).
    
    This appears to have been broken by commit 5fbd0418eef2 ("drm/i915:
    Re-enable GGTT earlier during resume on pre-gen6 platforms"), which
    moved the intel_enable_gtt() from i915_gem_init_hw() (also used by
    reset) to add it earlier during hw init and resume, missing the reset
    path.
    
    v2: Find the culprit, rearrange ggtt_enable to be before gem_init_hw to
    match init/resume
    
    Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Fixes: 5fbd0418eef2 ("drm/i915: Re-enable GGTT earlier during resume on pre-gen6 platforms")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101852
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: Daniel Vetter <daniel@ffwll.ch>
    Reviewed-by: Daniel Vetter <daniel@ffwll.ch>
    Link: https://patchwork.freedesktop.org/patch/msgid/20170906111405.27110-1-chris@chris-wilson.co.uk
    Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>

Comment 21 Martin Peres 2017-09-06 13:47:16 UTC

It indeed fixed the issue. Thanks a lot, Chris!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.