103260 – [CI] igt@gem_eio@in-flight-suspend - dmesg-warn - WARNING: CPU: 4 PID: 1503 at drivers/gpu/drm/i915/intel_ringbuffer.c | incomplete

Bug 103260 - [CI] igt@gem_eio@in-flight-suspend - dmesg-warn - WARNING: CPU: 4 PID: 1503 at drivers/gpu/drm/i915/intel_ringbuffer.c | incomplete

Summary: [CI] igt@gem_eio@in-flight-suspend - dmesg-warn - WARNING: CPU: 4 PID: 1503 a...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-10-13 12:47 UTC by Marta Löfstedt
Modified:	2018-03-02 15:52 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	HSW, SNB
i915 features:	GEM/Other

Attachments

Description Marta Löfstedt 2017-10-13 12:47:17 UTC

CI_DRM_3228 shard-hsw6 new test igt@gem_eio@in-flight-suspend, 

warns on:

[   54.749897] WARN_ON((dev_priv->uncore.funcs.mmio_readl(dev_priv, (((const i915_reg_t){ .reg = (((engine)->mmio_base)+0x9c) })), true) & (1 << 9)) == 0)
[   54.749918] ------------[ cut here ]------------
[   54.749968] :448 init_ring_common+0x606/0x610 [i915]
[   54.749971] ModuWARNING: CPU: 4 PID: 1503 at drivers/gpu/drm/i915/intel_ringbuffer.cles linked in: vgem snd_hda_codec_hdmi i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core mei_me snd_pcm r8169 mei mii prime_numbers lpc_ich
[   54.750043] CPU: 4 PID: 1503 Comm: kworker/u16:43 Tainted: G     U          4.14.0-rc4-CI-CI_DRM_3228+ #1
[   54.750046] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
[   54.750051] Workqueue: events_unbound async_run_entry_fn
[   54.750056] task: ffff880405770040 task.stack: ffffc9000099c000
[   54.750077] RIP: 0010:init_ring_common+0x606/0x610 [i915]
[   54.750080] RSP: 0018:ffffc9000099fc20 EFLAGS: 00010282
[   54.750085] RAX: 000000000000008b RBX: ffff8803fa3d0000 RCX: 0000000000000006
[   54.750088] RDX: 0000000000001306 RSI: ffffffff81d0e3d4 RDI: ffffffff81cc20ee
[   54.750091] RBP: ffffc9000099fc60 R08: 0000000000000000 R09: 0000000000000001
[   54.750093] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8804060d42a8
[   54.750096] R13: ffff88040374f458 R14: ffff8803fa3d0000 R15: 00000000000020c0
[   54.750099] FS:  0000000000000000(0000) GS:ffff88041fb00000(0000) knlGS:0000000000000000
[   54.750102] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   54.750105] CR2: 0000556f64ee5248 CR3: 0000000403b74004 CR4: 00000000001606e0
[   54.750107] Call Trace:
[   54.750131]  init_render_ring+0x17/0x170 [i915]
[   54.750151]  i915_gem_init_hw+0xf2/0x2a0 [i915]
[   54.750168]  i915_pm_restore+0x95/0x190 [i915]
[   54.750186]  i915_pm_resume+0xe/0x10 [i915]
[   54.750190]  pci_pm_resume+0x74/0xb0
[   54.750195]  dpm_run_callback+0x6f/0x310
[   54.750198]  ? pci_pm_suspend+0x140/0x140
[   54.750203]  device_resume+0xb4/0x1e0
[   54.750207]  ? dpm_watchdog_set+0x70/0x70
[   54.750213]  async_resume+0x1d/0x50
[   54.750217]  async_run_entry_fn+0x38/0x160
[   54.750222]  process_one_work+0x233/0x660
[   54.750228]  worker_thread+0x4e/0x3b0
[   54.750235]  kthread+0x152/0x190
[   54.750238]  ? process_one_work+0x660/0x660
[   54.750241]  ? kthread_create_on_node+0x40/0x40
[   54.750246]  ret_from_fork+0x27/0x40
[   54.750254] Code: 0f 0b 41 bf 80 41 00 00 e9 82 fe ff ff 41 bf 80 42 00 00 e9 77 fe ff ff 48 c7 c6 40 ae 2c a0 48 c7 c7 a1 3c 2b a0 e8 3b db ef e0 <0f> ff e9 cc fe ff ff 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 41 54 
[   54.750415] ---[ end trace c04d0ec270026d4a ]---

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-hsw6/igt@gem_eio@in-flight-suspend.html

Comment 1 Marta Löfstedt 2017-10-13 12:48:53 UTC

Also, incomplete on shards-SNB

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-snb2/igt@gem_eio@in-flight-suspend.html

Comment 2 Chris Wilson 2017-10-13 13:01:59 UTC

<7>[   41.696323] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x61/0x80 [i915], irq posted? no, current seqno=cf3a, last=cf7c
<7>[   49.696247] [drm:i915_reset_device [i915]] resetting chip
<5>[   49.696598] i915 0000:00:02.0: Resetting chip after gpu hang
<7>[   49.697285] [drm:i915_reset [i915]] GPU reset disabled
...
<4>[   54.749897] WARN_ON((dev_priv->uncore.funcs.mmio_readl(dev_priv, (((const i915_reg_t){ .reg = (((engine)->mmio_base)+0x9c) })), true) & (1 << 9)) == 0)
<4>[   54.749918] ------------[ cut here ]------------
<4>[   54.749968] WARNING: CPU: 4 PID: 1503 at drivers/gpu/drm/i915/intel_ringbuffer.c:448 init_ring_common+0x606/0x610 [i915]

So at a basic level it is a side-effect of the test. As we disable the GPU reset to cause the EIO, the ring is not idle when we try to restart it.

It looks like we can (a) always do stop-rings upon reset regardless of the availability of the GPU reset, and (b) extend the stop-ring coverage in init_ring_common() to not clear the STOP bit until after we are ready to restart.

Comment 3 Chris Wilson 2017-10-13 13:03:15 UTC

(In reply to Marta Löfstedt from comment #1)
> Also, incomplete on shards-SNB
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3228/shard-snb2/
> igt@gem_eio@in-flight-suspend.html

Unlikely to be this. My theory for those is https://patchwork.freedesktop.org/series/31848/

Comment 4 Marta Löfstedt 2017-10-13 13:20:59 UTC

Note at least the test so far give is stable results. Both incomplete and dmesg-warn are also on CI_DRM_3227:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3227/shard-snb5/igt@gem_eio@in-flight-suspend.html

and

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3227/shard-hsw4/igt@gem_eio@in-flight-suspend.html

Comment 5 Chris Wilson 2017-10-14 08:56:40 UTC

commit 5896a5c8c9c01b09af05b02cdb2ae275ef143959
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 13 14:12:18 2017 +0100

    drm/i915: Always stop the rings before a missing GPU reset
    
    Always try to stop the rings, even if the GPU reset itself has been
    disabled (via modparam i915.reset). This should at least stop the hw
    from spinning in the background consuming resources (e.g. power and
    memory bandwidth) letting the system rest-in-peace.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=103260
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171013131218.18013-2-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

commit 7836cd02f27c03af2fca04b450177c51fc7caf1e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 13 14:12:17 2017 +0100

    drm/i915: Keep the rings stopped until they have been re-initialized
    
    Before modifying the ring register (RING_START, HEAD, TAIL, CTL) we
    first make sure it is stopped (or else the hw may not resample the
    registers). However, we do not need to let the hw restart until after we
    have reprogrammed all the rings. This should help prevent situations
    where pending operations on the ring may resume (because we are trying
    to re-initialize following an unsuccessful GPU hang, i.e. from
    i915_gem_unset_wedged).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103260
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171013131218.18013-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

Comment 6 Marta Löfstedt 2017-10-16 07:20:26 UTC

Above fixes was integrated into CI_DRM_3237. and the dmesg-warn for HSW-shards appear to be gone. However, the SNB incompletes are still present.

Comment 7 Elizabeth 2018-03-02 15:52:29 UTC

According to CI results this specific warn haven't been seen in a while, neither the SNB incompletes with this test. So closing, thank you.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.