Bug 105249 - [CI] igt@gem_ctx_isolation@* - incomplete
Summary: [CI] igt@gem_ctx_isolation@* - incomplete
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Marta Löfstedt
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-26 07:30 UTC by Marta Löfstedt
Modified: 2018-03-26 06:24 UTC (History)
1 user (show)

See Also:
i915 platform: BDW, BXT, GLK, KBL, SKL
i915 features: GEM/Other


Attachments

Comment 1 Chris Wilson 2018-02-26 09:46:45 UTC
We hit an assert, can't see which and the trace looks like correct behaviour afaict.
Comment 2 Marta Löfstedt 2018-02-26 13:03:56 UTC
(In reply to Chris Wilson from comment #1)
> We hit an assert, can't see which and the trace looks like correct behaviour
> afaict.

Here is a new one:

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4299/shard-kbl2/igt@gem_ctx_isolation@bcs0-reset.html
Comment 4 Marta Löfstedt 2018-02-28 07:15:37 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4306/shard-kbl3/igt@gem_ctx_isolation@vcs1-reset.html

<0>[  412.300836] i915/sig-562     3..s2 412290976us : execlists_submission_tasklet: vcs1 out[0]: ctx=13.1, seqno=e, prio=0
<0>[  412.300844] ---------------------------------
<4>[  412.300848] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 snd_hda_intel x86_pkg_temp_thermal intel_powerclamp snd_hda_codec coretemp crct10dif_pclmul crc32_pclmul snd_hwdep snd_hda_core ghash_clmulni_intel e1000e snd_pcm mei_me mei prime_numbers
<4>[  412.300880] CPU: 3 PID: 562 Comm: i915/signal:3 Tainted: G     U           4.16.0-rc2-CI-CI_DRM_3838+ #1
<4>[  412.300887] Hardware name:  /NUC7i5BNB, BIOS BNKBL357.86A.0054.2017.1025.1822 10/25/2017
<4>[  412.300906] RIP: 0010:execlists_submission_tasklet+0x5ee/0xeb0 [i915]
<4>[  412.300912] RSP: 0018:ffff88027ed83ea8 EFLAGS: 00010296
<4>[  412.300917] RAX: 0000000000000027 RBX: 0000000000000004 RCX: 0000000000000103
<4>[  412.300923] RDX: 0000000080000103 RSI: ffffffff8211c277 RDI: 00000000ffffffff
<4>[  412.300928] RBP: ffff88027ed83f20 R08: 0000000000000001 R09: 0000000000000000
<4>[  412.300933] R10: ffff8802713c5ec0 R11: 0000000000000000 R12: ffff880269060008
<4>[  412.300939] R13: ffff880260fda040 R14: ffff880269060010 R15: ffff880269060008
<4>[  412.300944] FS:  0000000000000000(0000) GS:ffff88027ed80000(0000) knlGS:0000000000000000
<4>[  412.300951] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  412.300955] CR2: 00007f941f40d9e0 CR3: 0000000005210002 CR4: 00000000003606e0
<4>[  412.300961] Call Trace:
<4>[  412.300965]  <IRQ>
<4>[  412.300970]  tasklet_hi_action+0x89/0x110
<4>[  412.300976]  __do_softirq+0xc1/0x4aa
<4>[  412.300982]  irq_exit+0xa4/0xb0
<4>[  412.300985]  do_IRQ+0x67/0x120
<4>[  412.300990]  common_interrupt+0x84/0x84
<4>[  412.300994]  </IRQ>
<4>[  412.300997] RIP: 0010:_raw_spin_unlock_irq+0x2a/0x50
<4>[  412.301001] RSP: 0018:ffffc90000543db0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffffdd
<4>[  412.301008] RAX: ffff88026841a840 RBX: ffff88027eda1740 RCX: 0000000000000001
<4>[  412.301013] RDX: 0000000000000000 RSI: ffffffff8210fc21 RDI: 0000000000000001
<4>[  412.301019] RBP: ffffc90000543e00 R08: 0000000000000001 R09: 0000000000000001
<4>[  412.301024] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88027537d040
<4>[  412.301030] R13: ffff88027276f740 R14: ffff88026841a840 R15: 0000000000000001
<4>[  412.301038]  finish_task_switch+0x98/0x240
<4>[  412.301043]  ? finish_task_switch+0x6a/0x240
<4>[  412.301047]  ? __clear_rsb+0x15/0x3d
<4>[  412.301051]  ? __switch_to_asm+0x1d/0x30
<4>[  412.301056]  __schedule+0x3cf/0xb00
<4>[  412.301061]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
<4>[  412.301066]  ? __kthread_parkme+0x39/0x90
<4>[  412.301070]  schedule+0x37/0x90
<4>[  412.301074]  __kthread_parkme+0x3e/0x90
<4>[  412.301093]  ? intel_breadcrumbs_signaler+0x59/0x4c0 [i915]
<4>[  412.301112]  ? intel_breadcrumbs_signaler+0x59/0x4c0 [i915]
<4>[  412.301131]  intel_breadcrumbs_signaler+0x4af/0x4c0 [i915]
<4>[  412.301138]  kthread+0xfb/0x130
<4>[  412.301155]  ? __intel_engine_remove_signal+0xb0/0xb0 [i915]
<4>[  412.301160]  ? _kthread_create_on_node+0x30/0x30
<4>[  412.301166]  ret_from_fork+0x3a/0x50
<4>[  412.301171] Code: 7d c8 89 c7 c1 ef 08 83 e7 07 89 fb 41 89 bf c4 03 00 00 e9 e5 fa ff ff 48 c7 c6 59 d6 29 a0 48 c7 c7 b1 d4 29 a0 e8 97 b1 f1 e0 <0f> 0b 48 89 75 a8 4c 89 55 b0 e8 33 73 f3 e0 49 2b 84 24 08 15 
<1>[  412.301231] RIP: execlists_submission_tasklet+0x5ee/0xeb0 [i915] RSP: ffff88027ed83ea8
<4>[  412.301250] ---[ end trace 5f6a45705eaa4f7f ]---
<0>[  413.999230] Kernel panic - not syncing: Fatal exception in interrupt
<0>[  413.999245] Dumping ftrace buffer:
<0>[  413.999250]    (ftrace buffer empty)
<0>[  413.999254] Kernel Offset: disabled
Comment 6 Marta Löfstedt 2018-03-20 08:54:59 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_4/fi-skl-guc/igt@gem_ctx_isolation@rcs0-reset.html

no pstore "header" however first call trace:

<4>[   43.775799] Call Trace:
<4>[   43.775806]  <IRQ>
<4>[   43.775831]  guc_submission_tasklet+0x37b/0x940 [i915]
<4>[   43.775837]  tasklet_hi_action+0x8e/0x110
<4>[   43.775842]  __do_softirq+0xc1/0x4aa
<4>[   43.775846]  irq_exit+0xa4/0xb0
<4>[   43.775849]  do_IRQ+0x67/0x120
<4>[   43.775854]  common_interrupt+0xf/0xf
Comment 8 Marta Löfstedt 2018-03-21 08:07:33 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_3/fi-bdw-5557u/igt@gem_ctx_isolation@bcs0-s3.html

run.log:
pass: igt/gem_ctx_isolation/bcs0-s3

[15/97] skip: 8, pass: 7 -
FATAL: command execution failed
...
Completed CI_IGT_test drmtip_3/fi-bdw-5557u/34 : FAILURE
CI_IGT_test runtime 240 seconds
Rebooting fi-bdw-5557u

last dmesg:
<4>[   40.493107] Setting dangerous option reset - tainting kernel
<7>[   40.497149] [IGT] gem_ctx_isolation: starting subtest bcs0-S3
<6>[   40.567914] PM: suspend entry (deep)
Comment 9 Marta Löfstedt 2018-03-21 08:10:06 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3903/fi-cnl-y3/igt@gem_ctx_isolation@vecs0-reset.html

run.log:
running: igt/gem_ctx_isolation/vecs0-reset

[65/98] skip: 29, pass: 34, fail: 2 /     
FATAL: command execution failed
...
Completed CI_IGT_test CI_DRM_3903/fi-cnl-y3/23 : FAILURE
CI_IGT_test runtime 843 seconds
Rebooting fi-cnl-y3

Last dmesg:
<7>[  529.085780] [drm:verify_single_dpll_state.isra.79 [i915]] DPLL 1
<6>[  529.119263] Console: switching to colour frame buffer device 480x135
<6>[  529.269811] Console: switching to colour dummy device 80x25
<7>[  529.269862] [IGT] gem_ctx_isolation: executing
Followed by stray
Comment 10 Chris Wilson 2018-03-22 23:40:38 UTC
This should be fixed by
commit 0f36a85c3bd5e0dfcbb49af203a96a933dae86cf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 22 07:35:33 2018 +0000

    drm/i915: Flush pending interrupt following a GPU reset
Comment 11 Marta Löfstedt 2018-03-23 07:30:40 UTC
Patch integrated to CI_DRM_3969, I will monitor to hopefully close, will take some time since BAT machines are affected from the shardlist on BAT runs.
Comment 12 Chris Wilson 2018-03-23 09:48:27 UTC
(In reply to Chris Wilson from comment #10)
> This should be fixed by
> commit 0f36a85c3bd5e0dfcbb49af203a96a933dae86cf
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Mar 22 07:35:33 2018 +0000
> 
>     drm/i915: Flush pending interrupt following a GPU reset

Ah, that was only the set-wedge path. Reset path: https://patchwork.freedesktop.org/series/40550/
Comment 13 Marta Löfstedt 2018-03-23 12:23:28 UTC
(In reply to Chris Wilson from comment #12)
> (In reply to Chris Wilson from comment #10)
> > This should be fixed by
> > commit 0f36a85c3bd5e0dfcbb49af203a96a933dae86cf
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Thu Mar 22 07:35:33 2018 +0000
> > 
> >     drm/i915: Flush pending interrupt following a GPU reset
> 
> Ah, that was only the set-wedge path. Reset path:
> https://patchwork.freedesktop.org/series/40550/

Is there coming more... or should the bug be set to fixed?
Comment 14 Chris Wilson 2018-03-23 12:31:20 UTC
I'll remark the bug as fixed when that patch lands. Hopefully today so we can start to  get results over the w/e
Comment 15 Chris Wilson 2018-03-23 17:06:57 UTC
commit 46b3617dfec875c1414c6ccbfcab371c97735562
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 23 10:18:24 2018 +0000

    drm/i915: Actually flush interrupts on reset not just wedging
    
    Commit 0f36a85c3bd5 ("drm/i915: Flush pending interrupt following a GPU
    reset") got confused and only applied the flush to the set-wedge path
    (which itself is proving troublesome), but we also need the
    serialisation on the regular reset path. Oops.
    
    Move the interrupt into reset_irq() and make it common to the reset and
    final set-wedge.
    
    v2: reset_irq() after port cancellation, as we assert that
    execlists->active is sane for cancellation (and is being reset by
    reset_irq).
    
    References: 0f36a85c3bd5 ("drm/i915: Flush pending interrupt following a GPU reset")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Jeff McGee <jeff.mcgee@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180323101824.14645-1-chris@chris-wilson.co.uk


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.