107925 – [CI][SHARDS] igt@gem_eio@in-flight-suspend - dmesg-warn - GEM_BUG_ON(!execlists_is_active(execlists, 0))

Bug 107925 - [CI][SHARDS] igt@gem_eio@in-flight-suspend - dmesg-warn - GEM_BUG_ON(!execlists_is_active(execlists, 0))

Summary: [CI][SHARDS] igt@gem_eio@in-flight-suspend - dmesg-warn - GEM_BUG_ON(!execlis...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	high normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-09-13 18:09 UTC by Martin Peres
Modified:	2018-11-13 16:02 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	GLK
i915 features:	GEM/execlists

Attachments

Description Martin Peres 2018-09-13 18:09:48 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4817/shard-glk2/igt@gem_eio@in-flight-suspend.html

<3> [505.276217] process_csb:988 GEM_BUG_ON(!execlists_is_active(execlists, 0))
<4> [505.276399] ------------[ cut here ]------------
<2> [505.276402] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:988!
<4> [505.276438] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [505.276449] CPU: 1 PID: 2943 Comm: gem_eio Tainted: G     U            4.19.0-rc3-CI-CI_DRM_4817+ #1
<4> [505.276462] Hardware name: Intel Corporation NUC7CJYH/NUC7JYB, BIOS JYGLKCPX.86A.0027.2018.0125.1347 01/25/2018
<4> [505.276527] RIP: 0010:process_csb+0x4a8/0x780 [i915]
<4> [505.276536] Code: 57 70 f9 e0 48 8b 35 7f c5 19 00 49 c7 c0 90 6e 26 a0 b9 dc 03 00 00 48 c7 c2 d0 f4 22 a0 48 c7 c7 d3 2d 16 a0 e8 f8 fe ff e0 <0f> 0b 48 8b 75 d0 4c 8d a6 88 16 00 00 4c 89 e7 e8 f3 c7 7d e1 48
<4> [505.276562] RSP: 0018:ffffc90002f7ba48 EFLAGS: 00010086
<4> [505.276572] RAX: 000000000000000d RBX: ffff880268b12158 RCX: 0000000000000000
<4> [505.276582] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff880276d98ff8
<4> [505.276593] RBP: ffffc90002f7bab0 R08: 0000000000154618 R09: ffff88027666a000
<4> [505.276604] R10: ffffc90002f7ba38 R11: ffff880276d98ff8 R12: ffff88026504704c
<4> [505.276615] R13: 0000000000000001 R14: ffff880265047048 R15: ffff880265047040
<4> [505.276626] FS:  00007fd29b4b7980(0000) GS:ffff880277e80000(0000) knlGS:0000000000000000
<4> [505.276638] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [505.276647] CR2: 00005602f1dbe490 CR3: 0000000249f9e000 CR4: 0000000000340ee0
<4> [505.276657] Call Trace:
<4> [505.276716]  execlists_reset_prepare+0x54/0x150 [i915]
<4> [505.276772]  i915_gem_reset_prepare_engine+0x20/0x40 [i915]
<4> [505.276826]  i915_gem_reset_prepare+0x2c/0x70 [i915]
<4> [505.276876]  i915_reset+0x117/0x280 [i915]
<4> [505.276925]  i915_reset_device+0x1fb/0x290 [i915]
<4> [505.276976]  ? __intel_get_crtc_scanline+0x1c0/0x1c0 [i915]
<4> [505.276991]  ? work_on_cpu_safe+0x50/0x50
<4> [505.277041]  i915_handle_error+0x219/0x350 [i915]
<4> [505.277097]  ? reset_all_global_seqno.part.5+0x3c/0x260 [i915]
<4> [505.277109]  ? mark_held_locks+0x50/0x80
<4> [505.277159]  ? i915_drop_caches_set+0x16e/0x260 [i915]
<4> [505.277171]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [505.277182]  ? __mutex_unlock_slowpath+0x46/0x2b0
<4> [505.277234]  i915_drop_caches_set+0x1c6/0x260 [i915]
<4> [505.277246]  simple_attr_write+0xb0/0xd0
<4> [505.277256]  full_proxy_write+0x51/0x80
<4> [505.277267]  __vfs_write+0x31/0x180
<4> [505.277275]  ? rcu_lockdep_current_cpu_online+0x8f/0xd0
<4> [505.277286]  ? rcu_read_lock_sched_held+0x6f/0x80
<4> [505.277295]  ? rcu_sync_lockdep_assert+0x29/0x50
<4> [505.277305]  ? __sb_start_write+0x152/0x1f0
<4> [505.277313]  ? __sb_start_write+0x168/0x1f0
<4> [505.277322]  vfs_write+0xbd/0x1b0
<4> [505.277331]  ksys_write+0x50/0xc0
<4> [505.277340]  do_syscall_64+0x55/0x190
<4> [505.277349]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [505.277358] RIP: 0033:0x7fd29aa312b7
<4> [505.277365] Code: 44 00 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 5b fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 35 44 89 c7 48 89 44 24 08 e8 94 fd ff ff 48
<4> [505.277391] RSP: 002b:00007ffc240d9280 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
<4> [505.277404] RAX: ffffffffffffffda RBX: 0000000000000009 RCX: 00007fd29aa312b7
<4> [505.277415] RDX: 0000000000000005 RSI: 00007ffc240d9330 RDI: 0000000000000009
<4> [505.277426] RBP: 00007ffc240d9330 R08: 0000000000000000 R09: 0000000000000000
<4> [505.277436] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000005
<4> [505.277447] R13: 0000000000000003 R14: 00007fd29aa1f628 R15: 00007fd29aa1bd80

Comment 1 Chris Wilson 2018-09-13 18:15:44 UTC

Should hopefully be fixed by https://patchwork.freedesktop.org/patch/249316/

Comment 2 Chris Wilson 2018-09-13 19:46:04 UTC

Might want to summon Petri here. An interesting case, WARN because the failure happens after the test itself finished, but we BUGed out. That should be a much more severe error, previously an incomplete, as the machine rebooted.

Comment 3 Chris Wilson 2018-09-14 14:24:54 UTC

commit 8db601f09127eb974e6fcf7fb30c70344d5727f6 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 14 09:00:17 2018 +0100

    drm/i915/execlists: Reset CSB pointers on canceling requests (wedging)
    
    The prior assumption was that we did not need to reset the CSB on
    wedging when cancelling the outstanding requests as it would be cleaned
    up in the subsequent reset prior to restarting the GPU. However, what
    was not accounted for was that in preparing for the reset, we would try
    to process the outstanding CSB entries. If the GPU happened to complete
    a CS event just as we were performing the cancellation of requests, that
    event would be kept in the CSB until the reset -- but our bookkeeping
    was cleared, causing confusion when trying to complete the CS event.
    
    v2: Use a sanitize on unwedge to avoid interfering with eio suspend
    (where we intentionally disable GPU reset).
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107925
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180914080017.30308-3-chris@chris-wilson.co.uk

Comment 4 Lakshmi 2018-10-15 08:03:44 UTC

This issue occured only once with CI_DRM_4817_full (1 month / 400 runs ago).

Comment 5 Martin Peres 2018-11-13 16:02:40 UTC

(In reply to Lakshmi from comment #4)
> This issue occured only once with CI_DRM_4817_full (1 month / 400 runs ago).

In these cases, feel free to close and archive the issue in CI Bug log ;)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.