On the kernel CI_DRM_2852, the machine fi-gdg-551 hard hanged while running the test igt@drv_hangman@error-state-basicigt@drv_hangman@error-state-basic. No suspicious logs available... Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2852/fi-gdg-551/igt@drv_hangman@error-state-basic.html
Created attachment 132789 [details] Dmesg from gdg-551 igt@drv_hangman@error-state-basic
This log was captured with i915.reset=0. Without it, host rebooted instantly when running the test.
Similar death for me on a i915gm (eeepc 700). Worse, it used to work. v3.19 can reset without death.
First foray down the rabbit hole takes a wrong turn: e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259 is the first bad commit commit e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259 Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Date: Tue Feb 16 10:06:14 2016 +0100 drm/i915: Use atomic helpers for suspend, v2. That commit inserts a deadlock early into the reset. Last known working reset: 1db6e2e7dc2739181bd55c3c41263634803b3cc9 [v4.6] Now for bisect attempt #2 trying to ignore the deadlock.
And the not completely unsurprising result from the other side, is... ... drum roll ... commit dfa2997055659b4e706a85fba481050cc7e7ad82 Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Date: Fri Aug 5 23:28:27 2016 +0300 drm/i915: Fix modeset handling during gpu reset, v5. This function would call drm_modeset_lock_all, while the suspend/resume functions already have their own locking. Fix this by factoring out __intel_display_resume, and calling the atomic helpers for duplicating atomic state and disabling all crtc's during suspend. Changes since v1: - Deal with -EDEADLK right after lock_all and clean up calls to hw readout. - Always take all modeset locks so updates during gpu reset are blocked. Changes since v2: - Fix deadlock in intel_update_primary_planes. - Move WARN_ON(EDEADLK) to __intel_display_resume. - pctx -> ctx - only call __intel_display_resume on success in intel_display_resume. Changes since v3: - Rebase on top of dev_priv -> dev change. - Use drm_modeset_lock_all_ctx instead of drm_modeset_lock_all. Changes since v4 [by vsyrjala]: - Deal with skip_intermediate_wm - Update comment w.r.t. mode_config.mutex vs. ->detect() - Rebase due to INTEL_GEN() etc. Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: e2c8b8701e2d ("drm/i915: Use atomic helpers for suspend, v2.") Cc: stable@vger.kernel.org Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1470428910-12125-2-git-send-email-ville.syrjala@linux.intel.com (cherry picked from commit 739748939974791b84629a8790527a16f76873a4) Signed-off-by: Jani Nikula <jani.nikula@intel.com> But both are really a single patch implementing display reset using atomic.
It boils down to the machine hang occuring during the drm_atomic_commit() of the saved state. diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d index fbb8574..49a98b6 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -3129,7 +3129,8 @@ __intel_display_resume(struct drm_device *dev, /* ignore any reset values/BIOS leftovers in the WM registers */ to_intel_atomic_state(state)->skip_intermediate_wm = true; - ret = drm_atomic_commit(state); + ret = 0; + //ret = drm_atomic_commit(state); WARN_ON(ret == -EDEADLK); return ret;
Reverted the display reset code back to e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259^ on top of dfa2997055659b4e706a85fba481050cc7e7ad82, and the machine still dies during the modeset on resume. So something changed in the modesetting routines between those two point? printk debugging is not that informative, the hard machine death feels fairly random, presumably it dies when some incomplete state becomes active.
(In reply to Chris Wilson from comment #7) > Reverted the display reset code back to > e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259^ on top of > dfa2997055659b4e706a85fba481050cc7e7ad82, and the machine still dies during > the modeset on resume. So something changed in the modesetting routines > between those two point? Similarly cherry-picking dfa2997055659b4e706a85fba481050cc7e7ad82 onto e2c8b8701e2d0cb5b89fa3b5c8acae9dc4f76259 works. So definitely something went wrong in modesetting in between.
After a third round of bisecting with manual patching: bd18728a3f2edee33d4d241d20f394c798fb414 is the first bad commit commit 9bd18728a3f2edee33d4d241d20f394c798fb414 Author: Ville Syrjälä <ville.syrjala@linux.intel.com> Date: Fri May 13 10:10:42 2016 -0700 drm/i915: Ignore stale wm register values on resume on ilk-bdw (v2) When we resume the watermark register may contain some BIOS leftovers, or just the hardware reset values. We should ignore those as the pipes will be off anyway, and so frobbing around with intermediate watermarks doesn't make much sense. In fact I think we should just throw the skip_intermediate_wm flag out, and instead properly sanitize the "active" watermarks to match the current plane and pipe states. The actual wm state readout might also need a bit of work. But for now, let's continue with the skip_intermediate_wm to keep the fix more minimal. Fixes this sort of errors on resume [drm:ilk_validate_pipe_wm] LP0 watermark invalid [drm:intel_crtc_atomic_check] No valid intermediate pipe watermarks are possible [drm:intel_display_resume [i915]] *ERROR* Restoring old state failed with -22 and a boatload of subsequent modeset BAT fails on my ILK. v2: - Rebase; the SKL atomic WM patches that just landed changed the WM structure fields in intel_crtc_state slightly. (Matt) Cc: Matt Roper <matthew.d.roper@intel.com> Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com> Fixes: ed4a6a7ca853 ("drm/i915: Add two-stage ILK-style watermark programming (v11)") Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Matt Roper <matthew.d.roper@intel.com> Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1463159442-20478-1-git-send-email-matthew.d.roper@intel.com (cherry picked from commit e3d5457c7caabb77b3f1d0b09c4a63362e9b04d2) [Jani: rebase on drm-next while cherry-picking] Signed-off-by: Jani Nikula <jani.nikula@intel.com> Now starting to double check.
Walking backwards from that bad commit is proving disappointing, fails for commits that previously passed. ARGH.
Narrowed down to: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 347146f1ce7d..6ba47bc7b48e 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -12458,8 +12458,10 @@ static void intel_atomic_commit_tail(struct drm_atomic_state *state) } } +#if 0 /* Now enable the clocks, plane, pipe, and connectors that we set up. */ dev_priv->display.update_crtcs(state, &crtc_vblank_mask); +#endif /* FIXME: We should call drm_atomic_helper_commit_hw_done() here * already, but still need the state for the delayed optimization. To
It dies on this write: diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_display.c index 347146f1ce7d..ffa61d585cec 100644 --- a/drivers/gpu/drm/i915/intel_display.c +++ b/drivers/gpu/drm/i915/intel_display.c @@ -3337,7 +3337,7 @@ static void i9xx_update_primary_plane(struct intel_plane *primary, I915_WRITE_FW(PRIMCNSTALPHA(plane), 0); } - I915_WRITE_FW(reg, dspcntr); + //I915_WRITE_FW(reg, dspcntr); I915_WRITE_FW(DSPSTRIDE(plane), fb->pitches[0]); if (IS_HASWELL(dev_priv) || IS_BROADWELL(dev_priv)) {
Nothing unusual about it either, i9xx_update_primary_plane: plane=0, I915_WRITE_FW(459136, d9000000)
So my suspicion is in the reset, the pipes are switched over and that our post-reset sanitize isn't clearing the conflict.
(In reply to Chris Wilson from comment #14) > So my suspicion is in the reset, the pipes are switched over and that our > post-reset sanitize isn't clearing the conflict. Hmm. Isn't this a desktop part? No FBC on those so we shouldn't do any plane<->pipe switching. But somehow that write is trying to enable plane A on pipe B.
I'm on a i915m, not quite the same hw as farm's gdg. Fingers crossed that whatever works for me, works for gdg as well.
Ok, set both planes to zero, no effect. In __display_resume: for_each_pipe(dev_priv, pipe) { struct intel_crtc *crtc = intel_get_crtc_for_pipe(dev_priv, pipe); enum plane plane = crtc->plane; I915_WRITE_FW(DSPCNTR(plane), 0); I915_WRITE_FW(DSPADDR(plane), 0); }
(In reply to Chris Wilson from comment #17) > Ok, set both planes to zero, no effect. In __display_resume: > > for_each_pipe(dev_priv, pipe) { > struct intel_crtc *crtc = > intel_get_crtc_for_pipe(dev_priv, pipe); > enum plane plane = crtc->plane; > > I915_WRITE_FW(DSPCNTR(plane), 0); > I915_WRITE_FW(DSPADDR(plane), 0); > } Hmm. I wonder if gen3 still needs both DPLLs and/or pipes running when moving a plane between them. That was the case on 830, but I thought it got fixed on 855 already.
https://patchwork.freedesktop.org/series/29845/
commit 0db8c961209153498fe7e279b8f0d3deb81808f0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Sep 6 12:14:05 2017 +0100 drm/i915: Re-enable GTT following a device reset Ville Syrjälä spotted that PGETBL_CTL was losing its enable bit upon a reset. That was causing the display to show garbage on his 945gm. On my i915gm the effect was far more severe; re-enabling the display following the reset without PGETBL_CTL being enabled lead to an immediate hard hang. We do have a routine to re-enable PGETBL_CTL which is applicable to gen2-4, although on gen4 it is documented that a graphics reset doesn't alter the register (no such wording is given for gen3) and should be safe to call to punch back in the enable bit. However, that leaves the question of whether we need to completely re-initialise the register and the rest of the GSM. For g33/pnv/gen4+, where we do have a configurable page table, its contents do seem to be kept, and so we should be able to recover without having to reinitialise the GTT from scratch (as prior to g33, that register is configured by the BIOS and we leave alone except for the enable bit). This appears to have been broken by commit 5fbd0418eef2 ("drm/i915: Re-enable GGTT earlier during resume on pre-gen6 platforms"), which moved the intel_enable_gtt() from i915_gem_init_hw() (also used by reset) to add it earlier during hw init and resume, missing the reset path. v2: Find the culprit, rearrange ggtt_enable to be before gem_init_hw to match init/resume Reported-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Fixes: 5fbd0418eef2 ("drm/i915: Re-enable GGTT earlier during resume on pre-gen6 platforms") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101852 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Daniel Vetter <daniel@ffwll.ch> Reviewed-by: Daniel Vetter <daniel@ffwll.ch> Link: https://patchwork.freedesktop.org/patch/msgid/20170906111405.27110-1-chris@chris-wilson.co.uk Tested-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
It indeed fixed the issue. Thanks a lot, Chris!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.