Bug 100144 - [ILK][BAT] gem_exec_fence@await-hang-default.html
Summary: [ILK][BAT] gem_exec_fence@await-hang-default.html
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-10 07:57 UTC by Jani Saarinen
Modified: 2017-07-24 23:15 UTC (History)
1 user (show)

See Also:
i915 platform: ILK
i915 features: GEM/execlists


Attachments

Comment 1 Chris Wilson 2017-03-10 08:19:45 UTC
(In reply to Jani Saarinen from comment #0)
> Maybe could have also opened old
> https://bugs.freedesktop.org/show_bug.cgi?id=99736?

No, because it is not the same bug. Waiting on CI providing complete debug logs.
Comment 2 Chris Wilson 2017-03-16 14:41:46 UTC
Without capturing the oops here there is little progress we can make.
Comment 3 Chris Wilson 2017-03-17 09:06:45 UTC
[  209.126253] [IGT] gem_exec_fence: starting subtest await-hang-default
[  212.698224] [drm:missed_breadcrumb [i915]] render ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no
[  218.719419] [drm] GPU HANG: ecode 5:0:0xe77fff7f, in gem_exec_fence [6613], reason: Hang on render ring, action: reset
[  218.719698] [drm:i915_reset_and_wakeup [i915]] resetting chip
[  218.719770] drm/i915: Resetting chip after gpu hang
[  218.720366] [drm:i915_gem_reset [i915]] context gem_exec_fence[6613]/0 marked guilty (score 10) banned? no
[  218.720388] [drm:i915_gem_reset [i915]] resetting render ring to restart from tail of request 0x11884
[  218.720506] [drm:init_workarounds_ring [i915]] render ring: Number of context specific w/a: 0
[  220.697838] [drm:missed_breadcrumb [i915]] bsd ring missed breadcrumb at intel_breadcrumbs_hangcheck+0x5c/0x80 [i915], irq posted? no
[  225.867366] [IGT] gem_exec_fence: exiting, ret=0
[  225.868419] ------------[ cut here ]------------
[  225.868427] kernel BUG at drivers/gpu/drm/i915/i915_gem_request.c:413!
[  225.868434] invalid opcode: 0000 [#1] PREEMPT SMP
[  225.868440] Modules linked in: intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_hdmi ghash_clmulni_intel snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm lpc_ich mei_me mei i915 e1000e ptp pps_core prime_numbers
[  225.868478] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G        W       4.11.0-rc2-CI-Trybot_650+ #1
[  225.868487] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011
[  225.868523] Workqueue: events i915_clflush_work [i915]
[  225.868529] task: ffff880212190040 task.stack: ffffc90000138000
[  225.868554] RIP: 0010:__i915_gem_request_submit+0x19d/0x1a0 [i915]
[  225.868560] RSP: 0018:ffffc9000013bc90 EFLAGS: 00010016
[  225.868566] RAX: 0000000000010ccc RBX: ffff88020cc2f9c0 RCX: 0000000000000001
[  225.868573] RDX: 00000000ffffffff RSI: 00000000ffffffff RDI: ffff880206f67b28
[  225.868580] RBP: ffffc9000013bcb8 R08: 000000005640d526 R09: b90709f000000000
[  225.868586] R10: ffffffff82782218 R11: ffff880212190040 R12: 0000000000004b3f
[  225.868593] R13: ffff880206f67b00 R14: ffff880206e8a158 R15: ffff88020888c748
[  225.868600] FS:  0000000000000000(0000) GS:ffff88021bcc0000(0000) knlGS:0000000000000000
[  225.868609] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  225.868616] CR2: 00000000009ba888 CR3: 0000000207414000 CR4: 00000000000006e0
[  225.868624] Call Trace:
[  225.868648]  i915_gem_request_submit+0x2b/0x50 [i915]
[  225.868675]  i9xx_submit_request+0x16/0x50 [i915]
[  225.868700]  submit_notify+0x3f/0x5c [i915]
[  225.868720]  __i915_sw_fence_complete+0x176/0x220 [i915]
[  225.868730]  ? try_to_del_timer_sync+0x4d/0x60
[  225.868751]  i915_sw_fence_complete+0x25/0x40 [i915]
[  225.868772]  dma_i915_sw_fence_wake+0x26/0x60 [i915]
[  225.868781]  dma_fence_signal+0x146/0x230
[  225.868804]  i915_clflush_work+0x3d/0x140 [i915]
[  225.868826]  ? i915_clflush_work+0x61/0x140 [i915] 
[  225.868836]  process_one_work+0x1f4/0x6d0
[  225.868843]  ? process_one_work+0x16e/0x6d0
[  225.868850]  worker_thread+0x49/0x4a0
[  225.868857]  kthread+0x107/0x140
[  225.868863]  ? process_one_work+0x6d0/0x6d0
[  225.868870]  ? kthread_create_on_node+0x40/0x40
[  225.868878]  ret_from_fork+0x2e/0x40
[  225.868884] Code: 18 a0 be 93 01 00 00 48 c7 c7 20 f2 19 a0 e8 8b 3b fc e0 e9 a6 fe ff ff 48 89 df e8 ce 16 01 00 e9 f9 fe ff ff 0f 0b 0f 0b 0f 0b <0f> 0b 90 55 48 89 e5 41 55 41 54 53 48 8b 9f a8 00 00 00 49 89 
[  225.868961] RIP: __i915_gem_request_submit+0x19d/0x1a0 [i915] RSP: ffffc9000013bc90
[  225.868972] ---[ end trace 991ea876a8672e61 ]---
[  225.868979] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:33
[  225.868990] in_atomic(): 1, irqs_disabled(): 1, pid: 30, name: kworker/3:0
[  225.868998] INFO: lockdep is turned off.
[  225.869003] irq event stamp: 1192454
[  225.869010] hardirqs last  enabled at (1192453): [<ffffffff81876587>] _raw_spin_unlock_irq+0x27/0x50
[  225.869021] hardirqs last disabled at (1192454): [<ffffffff81876377>] _raw_spin_lock_irqsave+0x17/0x60
[  225.869034] softirqs last  enabled at (1192118): [<ffffffff81085cd9>] __do_softirq+0x1d9/0x4c0
[  225.869044] softirqs last disabled at (1192097): [<ffffffff81086139>] irq_exit+0xa9/0xc0
[  225.869054] Preemption disabled at:
[  225.869056] [<ffffffff815ec000>] dma_fence_signal+0x100/0x230
[  225.869069] CPU: 3 PID: 30 Comm: kworker/3:0 Tainted: G      D W       4.11.0-rc2-CI-Trybot_650+ #1
[  225.869080] Hardware name: Hewlett-Packard HP Compaq 8100 Elite SFF PC/304Ah, BIOS 786H1 v01.13 07/14/2011
[  225.869107] Workqueue: events i915_clflush_work [i915]
[  225.869114] Call Trace:
[  225.869122]  dump_stack+0x67/0x92 
[  225.869129]  ___might_sleep+0x162/0x250
[  225.869135]  __might_sleep+0x45/0x80
[  225.869143]  exit_signals+0x1f/0x2a0
[  225.869150]  do_exit+0x9f/0xcd0
[  225.869156]  ? kthread+0x107/0x140
[  225.869162]  ? process_one_work+0x6d0/0x6d0
[  225.869169]  ? kthread_create_on_node+0x40/0x40
[  225.869176]  rewind_stack_do_exit+0x17/0x20
[  225.869185] note: kworker/3:0[30] exited with preempt_count 2

That's pretty unbelievable and scary.
Comment 4 Jani Saarinen 2017-04-05 12:15:24 UTC
Statistics: Failure rate 1/165 run(s) (0%)

From Chris: it actually matches the bug I've been looking at the last 2 weeks.
Keeping active/open still.
Comment 5 Jani Saarinen 2017-04-05 15:40:19 UTC
Patch on review: https://patchwork.freedesktop.org/series/22527/
Comment 6 Chris Wilson 2017-04-07 09:23:23 UTC
The symptoms are close enough that I'm treating

commit cbb60b4b987c8a57533dca0f66887ed14a9498e5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 6 18:00:28 2017 +0100

    drm/i915: Advance ring->head fully when idle

as the likely fix (even though a seqno reset during that hang is actually unlikely). The bug cbb60b4b98 fixes is real enough.
Comment 7 Jani Saarinen 2017-04-20 14:37:11 UTC
Re-open as seen on pw run:
https://patchwork.freedesktop.org/series/23227/
and comment from Chris on IRC: "fdo finally replied, yup it has the same impossible bug-on as 100144"
Comment 8 Chris Wilson 2017-04-25 14:45:23 UTC
commit e6ba9992de6c63fe86c028b4876338e1cb7dac34
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 25 14:00:49 2017 +0100

    drm/i915: Differentiate between sw write location into ring and last hw read
    
    We need to keep track of the last location we ask the hw to read up to
    (RING_TAIL) separately from our last write location into the ring, so
    that in the event of a GPU reset we do not tell the HW to proceed into
    a partially written request (which can happen if that request is waiting
    for an external signal before being executed).
    
    v2: Refactor intel_ring_reset() (Mika)
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100144
    Testcase: igt/gem_exec_fence/await-hang
    Fixes: 821ed7df6e2a ("drm/i915: Update reset path to fix incomplete requests")
    Fixes: d55ac5bf97c6 ("drm/i915: Defer transfer onto execution timeline to actual hw submission")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170425130049.26147-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
Comment 9 Jani Saarinen 2017-04-25 19:02:33 UTC
Thanks for the fix, whitelisting again from CI.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.