Bug 102456

Summary: [BAT][GLK] igt@gem_exec_suspend@basic-s3 hits WARN_ON(wait_for_engine(engine, 50))
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs, ricardo.vega
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GLK i915 features: GEM/Other

Description Martin Peres 2017-08-28 15:40:35 UTC
On CI_DRM_3011, the machine fi-glk-2a hits the following issue when running igt@gem_exec_suspend@basic-s3:

[  320.240058] WARN_ON(wait_for_engine(engine, 50))
[  320.240117] ------------[ cut here ]------------
[  320.240163] WARNING: CPU: 0 PID: 3144 at drivers/gpu/drm/i915/i915_gem.c:3385 i915_gem_wait_for_idle+0x19d/0x200 [i915]
[  320.240166] Modules linked in: snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm r8169 mii prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel
[  320.240225] CPU: 0 PID: 3144 Comm: kworker/u8:15 Tainted: G     U          4.13.0-rc6-CI-CI_DRM_3011+ #1
[  320.240229] Hardware name: Intel Corp. Geminilake/GLK RVP2 LP4SD (07), BIOS GELKRVPA.X64.0045.B51.1704281422 04/28/2017
[  320.240236] Workqueue: events_unbound async_run_entry_fn
[  320.240241] task: ffff880174ff2780 task.stack: ffffc90000780000
[  320.240282] RIP: 0010:i915_gem_wait_for_idle+0x19d/0x200 [i915]
[  320.240285] RSP: 0018:ffffc90000783c40 EFLAGS: 00010286
[  320.240290] RAX: 0000000000000024 RBX: fffffffffffffffe RCX: 0000000000000006
[  320.240293] RDX: 0000000000000006 RSI: ffffffff81cf74e4 RDI: ffffffff81cae38e
[  320.240296] RBP: ffffc90000783c70 R08: 0000000000000000 R09: 0000000000000001
[  320.240299] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175e90008
[  320.240301] R13: ffff880168ad0000 R14: 0000000100004f0e R15: ffff880168ad4350
[  320.240305] FS:  0000000000000000(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
[  320.240308] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  320.240310] CR2: 00007f5324877218 CR3: 0000000003e0f000 CR4: 00000000003406f0
[  320.240313] Call Trace:
[  320.240358]  i915_gem_suspend+0x45/0x140 [i915]
[  320.240395]  i915_pm_suspend+0x86/0x1a0 [i915]
[  320.240403]  pci_pm_suspend+0x78/0x140
[  320.240411]  dpm_run_callback+0x6f/0x310
[  320.240415]  ? pci_pm_resume+0xa0/0xa0
[  320.240421]  __device_suspend+0x102/0x380
[  320.240427]  ? dpm_watchdog_set+0x70/0x70
[  320.240435]  async_suspend+0x1f/0xa0
[  320.240440]  async_run_entry_fn+0x38/0x160
[  320.240446]  process_one_work+0x224/0x650
[  320.240454]  worker_thread+0x4e/0x3b0
[  320.240462]  kthread+0x114/0x150
[  320.240465]  ? process_one_work+0x650/0x650
[  320.240469]  ? kthread_create_on_node+0x40/0x40
[  320.240475]  ret_from_fork+0x27/0x40
[  320.240486] Code: d0 0f 85 2e ff ff ff 48 83 c4 08 31 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c6 f0 9d 22 a0 48 c7 c7 68 7d 21 a0 e8 e4 f3 fb e0 <0f> ff 31 d2 4c 89 ee 48 c7 c7 90 da 12 a0 e8 50 92 00 e1 48 83 
[  320.240641] ---[ end trace 4fed512b7c104387 ]---

This issue then prevented the machine to suspend to RAM.

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3011/fi-glk-2a/igt@gem_exec_suspend@basic-s3.html
Comment 1 Chris Wilson 2017-08-28 15:47:07 UTC
See https://patchwork.freedesktop.org/series/29387/
Comment 2 Chris Wilson 2017-08-29 17:16:48 UTC
This should fix the suspend failure

commit cad9946c2a4375386062131858881cfd30fc1b8f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 26 12:09:33 2017 +0100

    drm/i915: Always sanity check engine state upon idling
    
    When we do a locked idle we know that afterwards all requests have been
    completed and the engines have been cleared of tasks. For whatever
    reason, this doesn't always happen and we may go into a suspend with
    ELSP still full, and this causes an issue upon resume as we get very,
    very confused.
    
    If the engines refuse to idle, mark the device as wedged. In the process
    we get rid of the maybe unused open-coded version of wait_for_engines
    reported by Nick Desaulniers and Matthias Kaehlcke.
    
    v2: Suppress the -EIO before suspend, but keep it for seqno wrap.

but leaves the underlying issue unresolved. FAIL -> WARN.
Comment 3 Jari Tahvanainen 2017-09-13 08:42:15 UTC
Moving high as being sporadic.
Comment 4 Marta Löfstedt 2017-10-16 11:21:06 UTC
This issue was filed against a machine that is no longer in BAT. The issue has never been reproduced on the current GLK machine in BAT.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.