Summary: | [ILK] [BAT] gem_exec_fence@await-hang-default hangs on CI | ||
---|---|---|---|
Product: | DRI | Reporter: | Jani Saarinen <jani.saarinen> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | ILK | i915 features: | GEM/Other |
Description
Jani Saarinen
2017-02-09 15:53:01 UTC
BUG_ON(test_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags)); [ 110.818123] [IGT] gem_exec_fence: starting subtest await-hang-default [ 120.858186] [drm:i915_reset_and_wakeup [i915]] resetting chip [ 120.858286] drm/i915: Resetting chip after gpu hang [ 120.858894] [drm:i915_gem_reset [i915]] context gem_exec_fence[6342]/0 marked guilty (score 10) banned? no [ 120.858915] [drm:i915_gem_reset [i915]] resetting render ring to restart from tail of request 0x1473a [ 120.861524] [drm:ironlake_enable_drps [i915]] fmax: 0, fmin: 10, fstart: 8 [ 120.863213] [drm:intel_guc_setup [i915]] GuC fw status: path (null), fetch NONE, load NONE [ 126.874047] [drm:i915_reset_and_wakeup [i915]] resetting chip [ 126.874205] drm/i915: Resetting chip after gpu hang [ 126.874514] ------------[ cut here ]------------ [ 126.874521] kernel BUG at ./include/linux/dma-fence.h:419! [ 126.874526] invalid opcode: 0000 [#1] PREEMPT SMP [ 126.874531] Modules linked in: intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec lpc_ich snd_hwdep snd_hda_core snd_pcm mei_me mei i915 sdhci_pci sdhci mmc_core e1000e ptp pps_core [ 126.874559] CPU: 1 PID: 31 Comm: kworker/1:1 Not tainted 4.10.0-rc7-CI-CI_DRM_2177+ #1 [ 126.874565] Hardware name: Hewlett-Packard HP EliteBook 8440p/172A, BIOS 68CCU Ver. F.24 09/13/2013 [ 126.874601] Workqueue: events_long i915_hangcheck_elapsed [i915] [ 126.874607] task: ffff8801329f4bc0 task.stack: ffffc9000015c000 [ 126.874639] RIP: 0010:i915_gem_reset+0x3b7/0x3d0 [i915] [ 126.874645] RSP: 0018:ffffc9000015fb80 EFLAGS: 00010202 [ 126.874650] RAX: 0000000000000003 RBX: ffff8801291c8008 RCX: ffff880126c07cb8 [ 126.874657] RDX: ffff8801291c8008 RSI: 000000000001473a RDI: ffff8801291c8008 [ 126.874663] RBP: ffffc9000015fbe0 R08: 0000000000012f4e R09: fc26e25000000000 [ 126.874669] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880126c07420 [ 126.874675] R13: ffff880126c04488 R14: ffff88012736b7f8 R15: ffff880125d2dec0 [ 126.874681] FS: 0000000000000000(0000) GS:ffff880137c40000(0000) knlGS:0000000000000000 [ 126.874691] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 126.874699] CR2: 00007ffed9abaf90 CR3: 0000000001e0f000 CR4: 00000000000006e0 [ 126.874708] Call Trace: [ 126.874744] ? ironlake_do_reset+0x94/0xa0 [i915] [ 126.874773] i915_reset+0x12a/0x1c0 [i915] [ 126.874803] i915_reset_and_wakeup+0xf7/0x150 [i915] [ 126.874833] i915_handle_error+0x19b/0x210 [i915] [ 126.874846] ? scnprintf+0x3d/0x70 [ 126.874880] hangcheck_declare_hang+0xc6/0xf0 [i915] [ 126.874916] ? intel_engine_get_active_head+0x56/0xd0 [i915] [ 126.874952] i915_hangcheck_elapsed+0x29a/0x2d0 [i915] [ 126.874966] process_one_work+0x1f4/0x6d0 [ 126.874974] ? process_one_work+0x16e/0x6d0 [ 126.874982] worker_thread+0x49/0x4a0 [ 126.874990] kthread+0x107/0x140 [ 126.874998] ? process_one_work+0x6d0/0x6d0 [ 126.875005] ? kthread_create_on_node+0x40/0x40 [ 126.875017] ret_from_fork+0x2e/0x40 [ 126.875024] Code: 38 ae 1b a0 e8 3b 12 fa e0 e9 88 fc ff ff 4c 89 e7 4c 89 45 a0 e8 ca de ff ff 48 8b 45 c0 4c 8b 45 a0 48 8b 70 30 e9 ef fe ff ff <0f> 0b 49 8b 06 48 05 f8 57 00 00 e9 6a fe ff ff 66 0f 1f 84 00 [ 126.875107] RIP: i915_gem_reset+0x3b7/0x3d0 [i915] RSP: ffffc9000015fb80 Looks like an intriguing race. commit 8c12d121590ebe5a43bf9a0aedbbeb192f257846 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 10 18:52:14 2017 +0000 drm/i915: Move the irq_barrier for reset earlier into reset_prepare Popped out when looking at that sequence and thinking "Ironlake". I don't think it explains the BUG_ON, but definitely should affect the timing that lead up to it. commit fe3288b5da2c1286a7aac1fb1b2234caa752a81b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Feb 12 17:20:01 2017 +0000 drm/i915: Park the breadcrumbs signaler across a GPU reset The signal threads may be running concurrently with the GPU reset. The completion from the GPU run asynchronous with the reset and two threads may see different snapshots of the state, and the signaler may mark a request as complete as we try to reset it. We don't tolerate 2 different views of the same state and complain if we try to mark a request as failed if it is already complete. Disable the signal threads during reset to prevent this conflict (even though the conflict implies that the state we resetting to is invalid, we have already made our decision!). Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99733 References: https://bugs.freedesktop.org/show_bug.cgi?id=99671 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170212172002.23072-4-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Patch landed. Lets follow situation and close if not seen anymore. Verifying. Closing |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.