101891 – [BAT][BDW] WARN_ON(!intel_engines_are_idle(dev_priv)) in i915_gem_suspend+0x123/0x140

Bug 101891 - [BAT][BDW] WARN_ON(!intel_engines_are_idle(dev_priv)) in i915_gem_suspend+0x123/0x140

Summary: [BAT][BDW] WARN_ON(!intel_engines_are_idle(dev_priv)) in i915_gem_suspend+0x1...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	highest critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-07-24 08:47 UTC by Martin Peres
Modified:	2017-10-03 20:54 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BDW
i915 features:	GEM/Other

Attachments

Description Martin Peres 2017-07-24 08:47:15 UTC

When running igt@gem_exec_suspend@basic-s3 on fi-bdw-5557u on CI_DRM_2859, we got the following warning:

[  205.129268] WARN_ON(!intel_engines_are_idle(dev_priv))
[  205.129283] ------------[ cut here ]------------
[  205.129312] WARNING: CPU: 1 PID: 3106 at drivers/gpu/drm/i915/i915_gem.c:4589 i915_gem_suspend+0x123/0x140 [i915]
[  205.129313] Modules linked in: snd_hda_codec_hdmi i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek snd_hda_codec_generic ghash_clmulni_intel e1000e snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core ptp mei_me pps_core lpc_ich mei snd_pcm prime_numbers i2c_hid
[  205.129345] CPU: 1 PID: 3106 Comm: kworker/u8:12 Tainted: G     U          4.13.0-rc1-CI-CI_DRM_2859+ #1
[  205.129346] Hardware name:                  /NUC5i7RYB, BIOS RYBDWi35.86A.0362.2017.0118.0940 01/18/2017
[  205.129350] Workqueue: events_unbound async_run_entry_fn
[  205.129353] task: ffff880247350040 task.stack: ffffc90000434000
[  205.129377] RIP: 0010:i915_gem_suspend+0x123/0x140 [i915]
[  205.129378] RSP: 0018:ffffc90000437c80 EFLAGS: 00010282
[  205.129381] RAX: 000000000000002a RBX: ffff88023e9d0000 RCX: 0000000000000006
[  205.129383] RDX: 0000000000000006 RSI: ffffffff81cf0a44 RDI: ffffffff81ca7bbe
[  205.129384] RBP: ffffc90000437c98 R08: 0000000000000000 R09: 0000000000000001
[  205.129385] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  205.129387] R13: ffff88023e9d7fb0 R14: 0000000000000000 R15: 0000000000000000
[  205.129388] FS:  0000000000000000(0000) GS:ffff880256c80000(0000) knlGS:0000000000000000
[  205.129390] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  205.129391] CR2: 000000000177e068 CR3: 0000000003e0f000 CR4: 00000000003406e0
[  205.129393] Call Trace:
[  205.129414]  i915_pm_suspend+0x80/0x190 [i915]
[  205.129419]  pci_pm_suspend+0x71/0x130
[  205.129422]  dpm_run_callback+0x6f/0x330
[  205.129425]  ? pci_pm_resume+0x90/0x90
[  205.129428]  __device_suspend+0xfb/0x370
[  205.129431]  ? dpm_watchdog_set+0x60/0x60
[  205.129435]  async_suspend+0x1a/0x90
[  205.129438]  async_run_entry_fn+0x33/0x160
[  205.129441]  process_one_work+0x1fe/0x670
[  205.129445]  worker_thread+0x49/0x3b0
[  205.129449]  kthread+0x10f/0x150
[  205.129451]  ? process_one_work+0x670/0x670
[  205.129453]  ? kthread_create_on_node+0x40/0x40
[  205.129456]  ret_from_fork+0x27/0x40
[  205.129461] Code: df e8 32 98 ff ff e9 4c ff ff ff 48 83 7a 08 00 0f 84 7a ff ff ff 0f 0b 48 c7 c6 f8 2e 2d a0 48 c7 c7 45 0a 2c a0 e8 89 0a f3 e0 <0f> ff eb c2 48 c7 c6 e6 0a 2c a0 48 c7 c7 45 0a 2c a0 e8 72 0a 
[  205.129534] ---[ end trace 0873ed3567407846 ]---

This has not been reproduced in 5 runs, so I assume it is not a common issue to hit.

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2859/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3.html

Comment 1 Chris Wilson 2017-07-28 16:18:34 UTC

It's just one of those impossible conditions that should never fire. The sequence is this

        /* As the idle_work is rearming if it detects a race, play safe and
         * repeat the flush until it is definitely idle.
         */
        while (flush_delayed_work(&dev_priv->gt.idle_work))
                ;

        /* Assert that we sucessfully flushed all the work and
         * reset the GPU back to its idle, low power state.
         */
        WARN_ON(dev_priv->gt.awake);
        WARN_ON(!intel_engines_are_idle(dev_priv));

The idle work waits for idle engines and sets gt.awake=false. Then before engines can be awoken, gt.awake=true. So we either have a race despite being in a single threaded suspend context, or... I have no idea.


bool intel_engines_are_idle(struct drm_i915_private *dev_priv)
{
        struct intel_engine_cs *engine;
        enum intel_engine_id id;

        if (READ_ONCE(dev_priv->gt.active_requests))
                return false;

        /* If the driver is wedged, HW state may be very inconsistent and
         * report that it is still busy, even though we have stopped using it.
         */
        if (i915_terminally_wedged(&dev_priv->gpu_error))
                return true;

        for_each_engine(engine, dev_priv, id) {
                if (!intel_engine_is_idle(engine))
                        return false;
        }

        return true;
}


bool intel_engine_is_idle(struct intel_engine_cs *engine)
{
        struct drm_i915_private *dev_priv = engine->i915;

        /* More white lies, if wedged, hw state is inconsistent */
        if (i915_terminally_wedged(&dev_priv->gpu_error))
                return true;

        /* Any inflight/incomplete requests? */
        if (!i915_seqno_passed(intel_engine_get_seqno(engine),
                               intel_engine_last_submit(engine)))
                return false;

        if (I915_SELFTEST_ONLY(engine->breadcrumbs.mock))
                return true;

        /* Interrupt/tasklet pending? */
        if (test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted))
                return false;

        /* Both ports drained, no more ELSP submission? */
        if (port_request(&engine->execlist_port[0]))
                return false;

        /* ELSP is empty, but there are ready requests? */
        if (READ_ONCE(engine->execlist_first))
                return false;

        /* Ring stopped? */
        if (!ring_is_idle(engine))
                return false;

        return true;
}


It might be possible for an interrupt to kick in and dirty irq_posted, a very late active->idle notification. Or the ring_is_idle() check on RING_MODE may be garbage.

I'm going to go back and play the waiting game. Note for future self, consider adding a WARN_ON(test_bit(ENGINE_IRQ_EXECLIST, &engine->irq_posted));

Comment 2 Chris Wilson 2017-08-29 17:19:00 UTC

To allow the machine to recover after we encounter this mystery:

commit fc692bd31bc9dd17c7cc59abdb514a58964fc2a7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 26 12:09:35 2017 +0100

    drm/i915: Discard the request queue if we fail to sleep before suspend
    
    If we fail to clear the outstanding request queue before suspending,
    mark those requests as lost.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.