Bug 111660

Summary:	[CI][RESUME] igt@gem_tiled_fence_blits@normal - incomplete - GEM_BUG_ON(!node_signaled(dep->signaler))
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	RESOLVED WORKSFORME	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	not set
Priority:	high	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	TGL	i915 features:	GEM/Other

Description Martin Peres 2019-09-11 11:44:27 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-tgl-u/igt@gem_tiled_fence_blits@normal.html

<3> [116.079585] i915_sched_node_fini:452 GEM_BUG_ON(!node_signaled(dep->signaler))
<4> [116.079663] ------------[ cut here ]------------
<2> [116.079665] kernel BUG at drivers/gpu/drm/i915/i915_scheduler.c:452!
<4> [116.079682] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
<4> [116.079687] CPU: 2 PID: 1023 Comm: gem_tiled_fence Tainted: G     U            5.3.0-rc7-gedc92c969391-drmtip_364+ #1
<4> [116.079693] Hardware name: Intel Corporation Tiger Lake Client Platform/TigerLake U DDR4 SODIMM RVP, BIOS TGLSFWI1.R00.2321.A01.1908052106 08/05/2019
<4> [116.079757] RIP: 0010:i915_sched_node_fini+0x193/0x3d0 [i915]
<4> [116.079762] Code: 8c 98 bc df 48 8b 35 c4 7a 1d 00 49 c7 c0 d9 83 6a c0 b9 c4 01 00 00 48 c7 c2 10 cd 64 c0 48 c7 c7 0e 80 56 c0 e8 1d 7c c3 df <0f> 0b 48 8b 3d a4 3d 1c 00 4c 89 f6 e8 4c 91 cd df e9 ee fe ff ff
<4> [116.079772] RSP: 0018:ffffb2c3006bf9d8 EFLAGS: 00010086
<4> [116.079776] RAX: 000000000000000f RBX: ffff8cea7111c9e0 RCX: 0000000000000000
<4> [116.079781] RDX: 0000000000000001 RSI: 0000000000000008 RDI: 0000000000000880
<4> [116.079785] RBP: ffff8cea7111c9f0 R08: 0000000000000000 R09: 0000000000000880
<4> [116.079790] R10: 000000006e5c0e39 R11: ffff8cec1e0c7a38 R12: dead000000000122
<4> [116.079794] R13: dead000000000100 R14: ffff8cea38554640 R15: 0000000000000000
<4> [116.079799] FS:  00007f2456c1f300(0000) GS:ffff8cec20500000(0000) knlGS:0000000000000000
<4> [116.079804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [116.079808] CR2: 00007f45c328ec00 CR3: 000000048fb8c003 CR4: 0000000000760ee0
<4> [116.079813] PKRU: 55555554
<4> [116.079815] Call Trace:
<4> [116.079849]  i915_request_retire+0x3b8/0x8e0 [i915]
<4> [116.079880]  i915_request_create+0x4b/0x1b0 [i915]
<4> [116.079908]  i915_gem_do_execbuffer+0xa79/0x22f0 [i915]
<4> [116.079915]  ? __thaw_task+0x40/0x40
<4> [116.079920]  ? stack_trace_save+0x46/0x70
<4> [116.079926]  ? init_object+0x66/0x80
<4> [116.079931]  ? __lock_acquire+0x4ac/0x1e90
<4> [116.079938]  ? __might_fault+0x39/0x90
<4> [116.079965]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [116.079991]  i915_gem_execbuffer2_ioctl+0x11b/0x460 [i915]
<4> [116.080017]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [116.080023]  drm_ioctl_kernel+0x83/0xf0
<4> [116.080028]  drm_ioctl+0x2f3/0x3b0
<4> [116.080054]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4> [116.080061]  do_vfs_ioctl+0xa0/0x6f0
<4> [116.080066]  ksys_ioctl+0x35/0x60
<4> [116.080070]  __x64_sys_ioctl+0x11/0x20
<4> [116.080075]  do_syscall_64+0x55/0x1c0
<4> [116.080079]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [116.080084] RIP: 0033:0x7f24560b25d7
<4> [116.080087] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48
<4> [116.080099] RSP: 002b:00007ffdad3c7928 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4> [116.080104] RAX: ffffffffffffffda RBX: 00000000000003dc RCX: 00007f24560b25d7
<4> [116.080109] RDX: 00007ffdad3c79d0 RSI: 0000000040406469 RDI: 0000000000000005
<4> [116.080115] RBP: 00007ffdad3c79d0 R08: 00007f2456387200 R09: 00007f2456387240
<4> [116.080120] R10: 0000000000000056 R11: 0000000000000246 R12: 0000000040406469
<4> [116.080124] R13: 0000000000000005 R14: 0000558f51821cf0 R15: 0000558f51827cec
<4> [116.080131] Modules linked in: vgem mei_hdcp i915 ax88179_178a usbnet mii x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel mei_me mei prime_numbers

Comment 1 CI Bug Log 2019-09-11 11:51:03 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: igt@gem_tiled_fence_blits@normal - incomplete - GEM_BUG_ON(!node_signaled(dep-&gt;signaler))
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-tgl-u/igt@gem_tiled_fence_blits@normal.html

Comment 2 Chris Wilson 2019-09-11 11:58:05 UTC

The observed seqno went backwards, as it thinks we executed before one of our dma-fence triggers. Seems to be a normal request flow (no hangs), so raises the spectre of incoherency, or just we managed to really upset the gpu?

Comment 3 Chris Wilson 2019-09-13 15:37:41 UTC

commit ee73e2795b416b829f0e00e7c43154922dff495b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 12 14:23:13 2019 +0100

    drm/i915/tgl: Disable preemption while being debugged
    
    We see failures where the context continues executing past a
    preemption event, eventually leading to situations where a request has
    executed before we have event submitted it to HW! It seems like tgl is
    ignoring our RING_TAIL updates, but more likely is that there is a
    missing update required for our semaphore waits around preemption.
    
    v2: And disable internal semaphore usage
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190912132313.12751-1-chris@chris-wilson.co.uk

Comment 4 CI Bug Log 2019-09-24 08:41:16 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* TGL: igt@runner@aborted - fail - Previous test: kms_rmfb (close-fd)
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_375/fi-tgl-u/igt@runner@aborted.html

Comment 5 Martin Peres 2019-10-16 13:07:44 UTC

This happened twice in 9 runs, but that was before shards came in and started testing this more often that 5 times a week. So I am inclined in believing that this is "fixed".

Is that an acceptable workaround for users though?

Comment 6 CI Bug Log 2019-10-16 13:18:58 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.