Bug 99733

Summary:	[ILK] [BAT] gem_exec_fence@await-hang-default hangs on CI
Product:	DRI	Reporter:	Jani Saarinen <jani.saarinen>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	ILK	i915 features:	GEM/Other

Description Jani Saarinen 2017-02-09 15:53:01 UTC

igt@gem_exec_fence@await-hang-default hangs on CI.

https://intel-gfx-ci.01.org/CI/fi-ilk-m540.html
=>
Dmesg: 
https://intel-gfx-ci.01.org/CI/CI_DRM_2177/fi-ilk-m540/dmesg-during.log

Comment 1 Chris Wilson 2017-02-09 15:59:00 UTC

BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags));

[  110.818123] [IGT] gem_exec_fence: starting subtest await-hang-default
[  120.858186] [drm:i915_reset_and_wakeup [i915]] resetting chip
[  120.858286] drm/i915: Resetting chip after gpu hang
[  120.858894] [drm:i915_gem_reset [i915]] context gem_exec_fence[6342]/0 marked guilty (score 10) banned? no
[  120.858915] [drm:i915_gem_reset [i915]] resetting render ring to restart from tail of request 0x1473a
[  120.861524] [drm:ironlake_enable_drps [i915]] fmax: 0, fmin: 10, fstart: 8
[  120.863213] [drm:intel_guc_setup [i915]] GuC fw status: path (null), fetch NONE, load NONE
[  126.874047] [drm:i915_reset_and_wakeup [i915]] resetting chip
[  126.874205] drm/i915: Resetting chip after gpu hang
[  126.874514] ------------[ cut here ]------------
[  126.874521] kernel BUG at ./include/linux/dma-fence.h:419!
[  126.874526] invalid opcode: 0000 [#1] PREEMPT SMP
[  126.874531] Modules linked in: intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec lpc_ich snd_hwdep snd_hda_core snd_pcm mei_me mei i915 sdhci_pci sdhci mmc_core e1000e ptp pps_core
[  126.874559] CPU: 1 PID: 31 Comm: kworker/1:1 Not tainted 4.10.0-rc7-CI-CI_DRM_2177+ #1
[  126.874565] Hardware name: Hewlett-Packard HP EliteBook 8440p/172A, BIOS 68CCU Ver. F.24 09/13/2013
[  126.874601] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  126.874607] task: ffff8801329f4bc0 task.stack: ffffc9000015c000
[  126.874639] RIP: 0010:i915_gem_reset+0x3b7/0x3d0 [i915]
[  126.874645] RSP: 0018:ffffc9000015fb80 EFLAGS: 00010202
[  126.874650] RAX: 0000000000000003 RBX: ffff8801291c8008 RCX: ffff880126c07cb8
[  126.874657] RDX: ffff8801291c8008 RSI: 000000000001473a RDI: ffff8801291c8008
[  126.874663] RBP: ffffc9000015fbe0 R08: 0000000000012f4e R09: fc26e25000000000
[  126.874669] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880126c07420
[  126.874675] R13: ffff880126c04488 R14: ffff88012736b7f8 R15: ffff880125d2dec0
[  126.874681] FS:  0000000000000000(0000) GS:ffff880137c40000(0000) knlGS:0000000000000000
[  126.874691] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  126.874699] CR2: 00007ffed9abaf90 CR3: 0000000001e0f000 CR4: 00000000000006e0
[  126.874708] Call Trace:
[  126.874744]  ? ironlake_do_reset+0x94/0xa0 [i915]
[  126.874773]  i915_reset+0x12a/0x1c0 [i915]
[  126.874803]  i915_reset_and_wakeup+0xf7/0x150 [i915]
[  126.874833]  i915_handle_error+0x19b/0x210 [i915]
[  126.874846]  ? scnprintf+0x3d/0x70
[  126.874880]  hangcheck_declare_hang+0xc6/0xf0 [i915]
[  126.874916]  ? intel_engine_get_active_head+0x56/0xd0 [i915]
[  126.874952]  i915_hangcheck_elapsed+0x29a/0x2d0 [i915]
[  126.874966]  process_one_work+0x1f4/0x6d0
[  126.874974]  ? process_one_work+0x16e/0x6d0
[  126.874982]  worker_thread+0x49/0x4a0
[  126.874990]  kthread+0x107/0x140
[  126.874998]  ? process_one_work+0x6d0/0x6d0
[  126.875005]  ? kthread_create_on_node+0x40/0x40
[  126.875017]  ret_from_fork+0x2e/0x40
[  126.875024] Code: 38 ae 1b a0 e8 3b 12 fa e0 e9 88 fc ff ff 4c 89 e7 4c 89 45 a0 e8 ca de ff ff 48 8b 45 c0 4c 8b 45 a0 48 8b 70 30 e9 ef fe ff ff <0f> 0b 49 8b 06 48 05 f8 57 00 00 e9 6a fe ff ff 66 0f 1f 84 00
[  126.875107] RIP: i915_gem_reset+0x3b7/0x3d0 [i915] RSP: ffffc9000015fb80

Comment 2 Chris Wilson 2017-02-09 16:00:06 UTC

Looks like an intriguing race.

Comment 3 Chris Wilson 2017-02-10 21:13:25 UTC

commit 8c12d121590ebe5a43bf9a0aedbbeb192f257846
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 10 18:52:14 2017 +0000

    drm/i915: Move the irq_barrier for reset earlier into reset_prepare

Popped out when looking at that sequence and thinking "Ironlake". I don't think it explains the BUG_ON, but definitely should affect the timing that lead up to it.

Comment 4 Chris Wilson 2017-02-13 11:22:35 UTC

commit fe3288b5da2c1286a7aac1fb1b2234caa752a81b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Feb 12 17:20:01 2017 +0000

    drm/i915: Park the breadcrumbs signaler across a GPU reset
    
    The signal threads may be running concurrently with the GPU reset. The
    completion from the GPU run asynchronous with the reset and two threads
    may see different snapshots of the state, and the signaler may mark a
    request as complete as we try to reset it. We don't tolerate 2 different
    views of the same state and complain if we try to mark a request as
    failed if it is already complete. Disable the signal threads during
    reset to prevent this conflict (even though the conflict implies that
    the state we resetting to is invalid, we have already made our
    decision!).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99733
    References: https://bugs.freedesktop.org/show_bug.cgi?id=99671
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170212172002.23072-4-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>

Comment 5 Jani Saarinen 2017-02-13 12:06:48 UTC

Patch landed. Lets follow situation and close if not seen anymore.

Comment 6 Jani Saarinen 2017-02-16 14:46:41 UTC

Verifying.

Comment 7 Jani Saarinen 2017-02-16 14:46:53 UTC

Closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.