| Summary: | [BAT][BDW] WARN_ON(!intel_engines_are_idle(dev_priv)) in i915_gem_suspend+0x123/0x140 | ||
|---|---|---|---|
| Product: | DRI | Reporter: | Martin Peres <martin.peres> |
| Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
| Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
| Severity: | critical | ||
| Priority: | highest | CC: | intel-gfx-bugs |
| Version: | DRI git | ||
| Hardware: | Other | ||
| OS: | All | ||
| Whiteboard: | ReadyForDev | ||
| i915 platform: | BDW | i915 features: | GEM/Other |
|
Description
Martin Peres
2017-07-24 08:47:15 UTC
It's just one of those impossible conditions that should never fire. The sequence is this
/* As the idle_work is rearming if it detects a race, play safe and
* repeat the flush until it is definitely idle.
*/
while (flush_delayed_work(&dev_priv->gt.idle_work))
;
/* Assert that we sucessfully flushed all the work and
* reset the GPU back to its idle, low power state.
*/
WARN_ON(dev_priv->gt.awake);
WARN_ON(!intel_engines_are_idle(dev_priv));
The idle work waits for idle engines and sets gt.awake=false. Then before engines can be awoken, gt.awake=true. So we either have a race despite being in a single threaded suspend context, or... I have no idea.
bool intel_engines_are_idle(struct drm_i915_private *dev_priv)
{
struct intel_engine_cs *engine;
enum intel_engine_id id;
if (READ_ONCE(dev_priv->gt.active_requests))
return false;
/* If the driver is wedged, HW state may be very inconsistent and
* report that it is still busy, even though we have stopped using it.
*/
if (i915_terminally_wedged(&dev_priv->gpu_error))
return true;
for_each_engine(engine, dev_priv, id) {
if (!intel_engine_is_idle(engine))
return false;
}
return true;
}
bool intel_engine_is_idle(struct intel_engine_cs *engine)
{
struct drm_i915_private *dev_priv = engine->i915;
/* More white lies, if wedged, hw state is inconsistent */
if (i915_terminally_wedged(&dev_priv->gpu_error))
return true;
/* Any inflight/incomplete requests? */
if (!i915_seqno_passed(intel_engine_get_seqno(engine),
intel_engine_last_submit(engine)))
return false;
if (I915_SELFTEST_ONLY(engine->breadcrumbs.mock))
return true;
/* Interrupt/tasklet pending? */
if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
return false;
/* Both ports drained, no more ELSP submission? */
if (port_request(&engine->execlist_port[0]))
return false;
/* ELSP is empty, but there are ready requests? */
if (READ_ONCE(engine->execlist_first))
return false;
/* Ring stopped? */
if (!ring_is_idle(engine))
return false;
return true;
}
It might be possible for an interrupt to kick in and dirty irq_posted, a very late active->idle notification. Or the ring_is_idle() check on RING_MODE may be garbage.
I'm going to go back and play the waiting game. Note for future self, consider adding a WARN_ON(test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted));
To allow the machine to recover after we encounter this mystery: commit fc692bd31bc9dd17c7cc59abdb514a58964fc2a7 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Aug 26 12:09:35 2017 +0100 drm/i915: Discard the request queue if we fail to sleep before suspend If we fail to clear the outstanding request queue before suspending, mark those requests as lost. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.