Bug 105341

Summary: [CI] igt@gem_eio@in-flight-external - incomplete - i915_gem_find_active_request:2880 GEM_BUG_ON((__builtin_constant_p((DMA_FENCE_FLAG_SIGNALED_BIT)) ? constant_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags)) : variable_test_bit((DMA_FENCE
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=104945
https://bugs.freedesktop.org/show_bug.cgi?id=105358
Whiteboard: ReadyForDev
i915 platform: HSW i915 features: GEM/Other

Description Marta Löfstedt 2018-03-05 06:40:09 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4324/shard-hsw3/igt@gem_eio@in-flight-external.html

From pstore:
<7>[ 1145.730073] [drm:i915_reset_device [i915]] resetting chip
<5>[ 1145.730114] i915 0000:00:02.0: Resetting chip after gpu hang
<3>[ 1145.730154] i915_gem_find_active_request:2880 GEM_BUG_ON((__builtin_constant_p((DMA_FENCE_FLAG_SIGNALED_BIT)) ? constant_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags)) : variable_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags))))
<4>[ 1145.730252] ------------[ cut here ]------------
<2>[ 1145.730253] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:2880!
<4>[ 1145.730268] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<0>[ 1145.730272] Dumping ftrace buffer:
<0>[ 1145.730276] ---------------------------------
<0>[ 1145.730304]    <...>-6615    5d..1 1145830701us : i915_gem_find_active_request: i915_gem_find_active_request:2880 GEM_BUG_ON((__builtin_constant_p((DMA_FENCE_FLAG_SIGNALED_BIT)) ? constant_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags)) : variable_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags))))
<0>[ 1145.730319] ---------------------------------
<4>[ 1145.730322] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me lpc_ich mei prime_numbers [last unloaded: i915]
<4>[ 1145.730348] CPU: 5 PID: 6615 Comm: gem_eio Tainted: G     U  W        4.16.0-rc3-CI-CI_DRM_3868+ #1
<4>[ 1145.730353] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[ 1145.730376] RIP: 0010:i915_gem_find_active_request+0x19b/0x1b0 [i915]
<4>[ 1145.730381] RSP: 0018:ffffc90001ab7b30 EFLAGS: 00010086
<4>[ 1145.730384] RAX: 000000000000003d RBX: ffff880407ed8008 RCX: 000000000000011c
<4>[ 1145.730389] RDX: 0000000000000008 RSI: ffff88040dbd5088 RDI: ffff88040dbd4fb8
<4>[ 1145.730393] RBP: ffff88035c0970c0 R08: ffff88040d6c3000 R09: 0000000000000001
<4>[ 1145.730397] R10: 0000000000000001 R11: 0000000000000005 R12: ffff880401897228
<4>[ 1145.730401] R13: ffff880401890000 R14: 0000000000000000 R15: ffff88040189a600
<4>[ 1145.730406] FS:  00007fd772e3f8c0(0000) GS:ffff88041fb40000(0000) knlGS:0000000000000000
<4>[ 1145.730410] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 1145.730414] CR2: 00007f40d26b1888 CR3: 00000003f8c44004 CR4: 00000000001606e0
<4>[ 1145.730418] Call Trace:
<4>[ 1145.730438]  i915_gem_reset_prepare_engine+0x6e/0x90 [i915]
<4>[ 1145.730459]  i915_gem_reset_prepare+0x2c/0x60 [i915]
<4>[ 1145.730476]  i915_reset+0x79/0x2a0 [i915]
<4>[ 1145.730493]  i915_reset_device+0x1cb/0x230 [i915]
<4>[ 1145.730511]  ? __intel_get_crtc_scanline+0x1f0/0x1f0 [i915]
<4>[ 1145.730517]  ? work_on_cpu_safe+0x50/0x50
<4>[ 1145.730536]  i915_handle_error+0x2d3/0x430 [i915]
<4>[ 1145.730545]  ? __might_fault+0x39/0x90
<4>[ 1145.730550]  ? __might_fault+0x39/0x90
<4>[ 1145.730578]  i915_wedged_set+0x79/0xc0 [i915]
<4>[ 1145.730585]  simple_attr_write+0xab/0xc0
<4>[ 1145.730591]  full_proxy_write+0x4b/0x70
<4>[ 1145.730598]  __vfs_write+0x1e/0x130
<4>[ 1145.730603]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[ 1145.730609]  ? rcu_sync_lockdep_assert+0x25/0x50
<4>[ 1145.730615]  ? __sb_start_write+0xd9/0x1f0
<4>[ 1145.730620]  ? __sb_start_write+0xf3/0x1f0
<4>[ 1145.730626]  vfs_write+0xbd/0x1b0
<4>[ 1145.730631]  SyS_write+0x40/0xa0
<4>[ 1145.730636]  do_syscall_64+0x65/0x1a0
<4>[ 1145.730642]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
<4>[ 1145.730648] RIP: 0033:0x7fd771625670
<4>[ 1145.730653] RSP: 002b:00007ffe08a26d28 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[ 1145.730661] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd771625670
<4>[ 1145.730668] RDX: 0000000000000002 RSI: 0000560dcb2c354e RDI: 0000000000000006
<4>[ 1145.730675] RBP: 0000000000000002 R08: 0000560dcc88c390 R09: 0000000000000000
<4>[ 1145.730681] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000006
<4>[ 1145.730688] R13: 0000560dcb2c354e R14: 0000000000000000 R15: 0000000000000000
<4>[ 1145.730697] Code: e8 e0 4a c8 e0 48 8b 35 64 97 1b 00 49 c7 c0 48 62 58 a0 b9 40 0b 00 00 48 c7 c2 d0 79 55 a0 48 c7 c7 76 87 46 a0 e8 75 71 ce e0 <0f> 0b 31 ed e9 cf fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 
<1>[ 1145.730776] RIP: i915_gem_find_active_request+0x19b/0x1b0 [i915] RSP: ffffc90001ab7b30
<4>[ 1145.730785] ---[ end trace 90ca7f022a777767 ]---
<4>[ 1145.770287] hpet1: lost 1 rtc interrupts
<4>[ 1145.806186] hpet1: lost 2 rtc interrupts
<4>[ 1145.837626] hpet1: lost 1 rtc interrupts
<4>[ 1145.888597] hpet1: lost 1 rtc interrupts
<4>[ 1145.916406] hpet1: lost 1 rtc interrupts
<3>[ 1146.250906] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
<3>[ 1146.250915] in_atomic(): 1, irqs_disabled(): 1, pid: 6615, name: gem_eio
<4>[ 1146.250932] INFO: lockdep is turned off.
<4>[ 1146.250936] irq event stamp: 28900
<4>[ 1146.250942] hardirqs last  enabled at (28899): [<0000000013747b2c>] _raw_spin_unlock_irq+0x24/0x50
<4>[ 1146.250952] hardirqs last disabled at (28900): [<00000000117eda70>] _raw_spin_lock_irqsave+0xd/0x50
<4>[ 1146.250962] softirqs last  enabled at (28836): [<000000008c33a24f>] peernet2id+0x3c/0x50
<4>[ 1146.250970] softirqs last disabled at (28834): [<00000000a80db8cc>] peernet2id+0x1d/0x50
<3>[ 1146.250977] Preemption disabled at:
<4>[ 1146.250978] [<0000000000000000>]           (null)
<4>[ 1146.250988] CPU: 5 PID: 6615 Comm: gem_eio Tainted: G     UD W        4.16.0-rc3-CI-CI_DRM_3868+ #1
<4>[ 1146.250996] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[ 1146.251003] Call Trace:
<4>[ 1146.251009]  dump_stack+0x5f/0x86
<4>[ 1146.251015]  ___might_sleep+0x1d9/0x240
<4>[ 1146.251022]  exit_signals+0x1b/0x2a0
<4>[ 1146.251028]  do_exit+0x93/0xcb0
<4>[ 1146.251032]  ? SyS_write+0x40/0xa0
<4>[ 1146.251038]  rewind_stack_do_exit+0x17/0x20
<6>[ 1146.251098] note: gem_eio[6615] exited with preempt_count 1
<4>[ 1146.258017] hpet1: lost 21 rtc interrupts
<1>[ 1150.954269] BUG: unable to handle kernel paging request at 0000000000021748
<1>[ 1150.954296] IP: expire_timers+0x6d/0x1e0
<6>[ 1150.954305] PGD 0 P4D 0 
<4>[ 1150.954315] Oops: 0002 [#2] PREEMPT SMP PTI
<0>[ 1150.954327] Dumping ftrace buffer:
<0>[ 1150.954339]    (ftrace buffer empty)
<4>[ 1150.954348] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul snd_hda_codec_realtek crc32_pclmul snd_hda_codec_generic ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm mii mei_me lpc_ich mei prime_numbers [last unloaded: i915]
<4>[ 1150.954426] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G     UD W        4.16.0-rc3-CI-CI_DRM_3868+ #1
<4>[ 1150.954442] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[ 1150.954458] RIP: 0010:expire_timers+0x6d/0x1e0
<4>[ 1150.954468] RSP: 0018:ffff88041fb43ed8 EFLAGS: 00010002
<4>[ 1150.954480] RAX: 0000000000021740 RBX: ffffc90001ab7c18 RCX: 0000000000000103
<4>[ 1150.954493] RDX: ffff88041fb43f10 RSI: 0000000000000001 RDI: 00000000ffffffff
<4>[ 1150.954506] RBP: ffff88041fb59980 R08: 0000000000000000 R09: 0000000000000000
<4>[ 1150.954519] R10: ffff88041fb43ed8 R11: ffffffff814635ed R12: ffff88041fb59980
<4>[ 1150.954532] R13: ffff88041fb43f10 R14: ffffffff823400f0 R15: dead000000000200
<4>[ 1150.954545] FS:  0000000000000000(0000) GS:ffff88041fb40000(0000) knlGS:0000000000000000
<4>[ 1150.954560] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[ 1150.954572] CR2: 0000000000021748 CR3: 0000000005210003 CR4: 00000000001606e0
<4>[ 1150.954584] Call Trace:
<4>[ 1150.954592]  <IRQ>
<4>[ 1150.954602]  run_timer_softirq+0x75/0x140
<4>[ 1150.954613]  ? ktime_get+0x5a/0x100
<4>[ 1150.954621]  ? ktime_get+0x84/0x100
<4>[ 1150.954633]  __do_softirq+0xc1/0x4aa
<4>[ 1150.954648]  irq_exit+0xa4/0xb0
<4>[ 1150.954656]  smp_apic_timer_interrupt+0x90/0x2d0
<4>[ 1150.954667]  apic_timer_interrupt+0x84/0x90
<4>[ 1150.954677]  </IRQ>
<4>[ 1150.954686] RIP: 0010:cpuidle_enter_state+0xaa/0x350
<4>[ 1150.954696] RSP: 0018:ffffc900000b7eb8 EFLAGS: 00000216 ORIG_RAX: ffffffffffffff12
<4>[ 1150.954712] RAX: ffff88041fb40000 RBX: 0000000005b2639a RCX: 0000000000000000
<4>[ 1150.954725] RDX: 0000010bfa382b40 RSI: ffffffff820b24f7 RDI: ffffffff8171f8e6
<4>[ 1150.954738] RBP: 0000000000000005 R08: 0000000000018658 R09: 000000000000000c
<4>[ 1150.954751] R10: 00000000000120dd R11: 0000000000018658 R12: 0000000000000005
<4>[ 1150.954764] R13: ffffe8ffffd49090 R14: 0000010bf485c7a6 R15: ffffffff822944a0
<4>[ 1150.954781]  ? cpuidle_enter_state+0xa6/0x350
<4>[ 1150.954793]  ? cpuidle_enter_state+0xa6/0x350
<4>[ 1150.954806]  do_idle+0x188/0x1d0
<4>[ 1150.954816]  cpu_startup_entry+0x14/0x20
<4>[ 1150.954828]  start_secondary+0x129/0x160
<4>[ 1150.954840]  secondary_startup_64+0xa5/0xb0
<4>[ 1150.954854] Code: 00 00 49 c7 c6 f0 00 34 82 65 8b 05 ae a1 f0 7e 89 c0 49 0f a3 06 0f 82 f0 00 00 00 48 8b 03 48 8b 53 08 48 85 c0 48 89 02 74 04 <48> 89 50 08 f6 43 22 20 48 c7 43 08 00 00 00 00 4c 89 e7 4c 89 
<1>[ 1150.954962] RIP: expire_timers+0x6d/0x1e0 RSP: ffff88041fb43ed8
<4>[ 1150.954973] CR2: 0000000000021748
<4>[ 1150.954982] ---[ end trace 90ca7f022a777768 ]---
<4>[ 1150.994774] hpet1: lost 2 rtc interrupts
<4>[ 1151.039928] hpet1: lost 2 rtc interrupts
<4>[ 1151.080491] hpet1: lost 1 rtc interrupts
<4>[ 1151.100920] hpet1: lost 1 rtc interrupts
<4>[ 1151.140089] hpet1: lost 1 rtc interrupts
<4>[ 1151.170349] hpet1: lost 1 rtc interrupts
<4>[ 1151.195889] hpet1: lost 1 rtc interrupts
<4>[ 1151.248469] hpet1: lost 1 rtc interrupts
<4>[ 1151.282795] hpet1: lost 1 rtc interrupts
<4>[ 1151.304437] hpet1: lost 1 rtc interrupts
<0>[ 1154.599889] Kernel panic - not syncing: Fatal exception in interrupt
<0>[ 1154.599912] Dumping ftrace buffer:
<0>[ 1154.599916]    (ftrace buffer empty)
<0>[ 1154.599919] Kernel Offset: disabled
Comment 1 Chris Wilson 2018-03-05 10:23:02 UTC
Further set-wedge vs reset race. :|
Comment 2 Marta Löfstedt 2018-03-06 07:33:48 UTC
I closed bug 104945 now we are hitting this issue instead:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3875/shard-kbl5/igt@gem_eio@in-flight-external.html
Comment 3 Marta Löfstedt 2018-03-06 11:50:10 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3877/shard-apl6/igt@gem_eio@in-flight-contexts.html

From pstore:
<5>[  136.759626] i915 0000:00:02.0: Resetting chip after gpu hang
<3>[  136.759926] i915_gem_find_active_request:2880 GEM_BUG_ON((__builtin_constant_p((DMA_FENCE_FLAG_SIGNALED_BIT)) ? constant_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags)) : variable_test_bit((DMA_FENCE_FLAG_SIGNALED_BIT), (&request->fence.flags))))
<4>[  136.760269] ------------[ cut here ]------------
<2>[  136.760274] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:2880!
<4>[  136.760325] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
Comment 6 Chris Wilson 2018-03-16 10:18:38 UTC
commit ac697ae8013a7c7301174c9c3b02a92fe418b7ea
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 15 15:10:15 2018 +0000

    drm/i915: Stop engines when declaring the machine wedged
    
    If we fail to reset the GPU, we declare the machine wedged. However, the
    GPU may well still be running in the background with an in-flight
    request. So despite our efforts in cleaning up the request queue and
    faking the breadcrumb in the HWSP, the GPU may eventually write the
    in-flght seqno there breaking all of our assumptions and throwing the
    driver into a deep turmoil, wedging beyond wedged.
    
    To avoid this we ideally want to reset the GPU. Since that has already
    failed, make sure the rings have the stop bit set instead. This is part
    of the normal GPU reset sequence, but that is actually disabled by
    igt/gem_eio to force the wedged state. If we assume the worst, we must
    poke at the bit again before we give up.
    
    v2: Move the intel_gpu_reset() from set-wedged in the reset error path
    into i915_gem_set_wedged() itself. Even if the reset fails (e.g. if it is
    disabled by gem_eio), it still tries to make sure the engines are
    stopped. For i915_gem_set_wedged() callers from outside of i915_reset(),
    this should make sure the GPU is disabled while the driver is marked as
    being wedged.
    
    Testcase: igt/gem_eio
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180315151015.22741-1-chris@chris-wilson.co.uk
Comment 7 Marta Löfstedt 2018-03-19 10:09:58 UTC
It's looking green!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.