110520 – [CI][SHARDS] igt@i915_selftest@live_evict - incomplete - RIP: \d+:__queue_work

Bug 110520 - [CI][SHARDS] igt@i915_selftest@live_evict - incomplete - RIP: \d+:__queue_work

Summary: [CI][SHARDS] igt@i915_selftest@live_evict - incomplete - RIP: \d+:__queue_work

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	high normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2019-04-26 06:22 UTC by Martin Peres
Modified:	2019-07-08 08:42 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	SKL
i915 features:	GEM/Other

Attachments

Description Martin Peres 2019-04-26 06:22:45 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5990/shard-skl2/igt@i915_selftest@live_evict.html

<4> [356.861098] WARNING: CPU: 3 PID: 0 at kernel/workqueue.c:1416 __queue_work+0x443/0x580
<4> [356.861137] Modules linked in: i915(+) snd_hda_codec_hdmi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec btusb snd_hwdep btrtl ghash_clmulni_intel btbcm btintel snd_hda_core cdc_ether snd_pcm usbnet r8152 bluetooth mii ecdh_generic i2c_hid pinctrl_sunrisepoint pinctrl_intel prime_numbers [last unloaded: i915]
<4> [356.861220] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G     U            5.1.0-rc6-CI-CI_DRM_5990+ #1
<4> [356.861230] Hardware name: Google Caroline/Caroline, BIOS MrChromebox 08/27/2018
<4> [356.861244] RIP: 0010:__queue_work+0x443/0x580
<4> [356.861257] Code: fd ff ff e8 71 3d f3 ff e9 0c fd ff ff 65 8b 05 d3 71 f4 7e a9 00 01 1f 00 75 0f 65 48 8b 3c 25 c0 4e 01 00 f6 47 24 20 75 6f <0f> 0b 48 83 c4 08 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 0b e9 fe fc
<4> [356.861268] RSP: 0018:ffff88817ab83e98 EFLAGS: 00010006
<4> [356.861281] RAX: 0000000000000101 RBX: ffff88816baf8f40 RCX: 0000000000000000
<4> [356.861291] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffff
<4> [356.861301] RBP: 0000000000000202 R08: 0000000000000000 R09: 0000000000000001
<4> [356.861311] R10: ffff88817ab83e98 R11: 0000000000000000 R12: ffff888166e53ba8
<4> [356.861320] R13: 0000000000000010 R14: 0000000000000010 R15: ffffffff8113bf97
<4> [356.861332] FS:  0000000000000000(0000) GS:ffff88817ab80000(0000) knlGS:0000000000000000
<4> [356.861343] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [356.861353] CR2: 0000557e3b1df508 CR3: 0000000005214005 CR4: 00000000003606e0
<4> [356.861362] Call Trace:
<4> [356.861374]  <IRQ>
<4> [356.861389]  ? queue_work_on+0x31/0x70
<4> [356.861418]  ? rcu_core+0x3c7/0x6b0
<4> [356.861430]  queue_work_on+0x60/0x70
<4> [356.861766]  ? i915_gem_free_object+0x110/0x110 [i915]
<4> [356.861788]  rcu_core+0x36c/0x6b0
<4> [356.861826]  __do_softirq+0xd8/0x4b9
<4> [356.861865]  irq_exit+0xa9/0xc0
<4> [356.861879]  smp_apic_timer_interrupt+0x9c/0x250
<4> [356.861901]  apic_timer_interrupt+0xf/0x20
<4> [356.861914]  </IRQ>
<4> [356.861934] RIP: 0010:cpuidle_enter_state+0xae/0x450
<4> [356.861950] Code: 44 00 00 31 ff e8 32 11 93 ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 78 03 00 00 31 ff e8 6b e8 99 ff e8 f6 cb 9d ff fb 45 85 ed <0f> 88 c9 02 00 00 4c 2b 24 24 48 ba cf f7 53 e3 a5 9b c4 20 49 63
<4> [356.861967] RSP: 0018:ffffc9000008fea0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
<4> [356.861990] RAX: ffff888179a1a840 RBX: ffffffff8229ea20 RCX: 0000000000000003
<4> [356.862008] RDX: 0000000000000046 RSI: 0000000000000006 RDI: ffffffff821203f1
<4> [356.862023] RBP: ffffe8ffffd82060 R08: 0000000000000002 R09: 0000000000000000
<4> [356.862034] R10: 0000000000000000 R11: 0000000000000000 R12: 000000531691caaa
<4> [356.862043] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000004
<4> [356.862099]  ? rcu_preempt_deferred_qs+0x19/0x80
<4> [356.862127]  do_idle+0x1f3/0x260
<4> [356.862156]  cpu_startup_entry+0x14/0x20
<4> [356.862177]  start_secondary+0x173/0x1c0
<4> [356.862201]  secondary_startup_64+0xa4/0xb0
<4> [356.862250] irq event stamp: 871279
<4> [356.862265] hardirqs last  enabled at (871278): [<ffffffff810cde01>] queue_work_on+0x31/0x70
<4> [356.862277] hardirqs last disabled at (871279): [<ffffffff810cdde9>] queue_work_on+0x19/0x70
<4> [356.862293] softirqs last  enabled at (871260): [<ffffffff810b5468>] irq_enter+0x58/0x60
<4> [356.862309] softirqs last disabled at (871261): [<ffffffff810b5519>] irq_exit+0xa9/0xc0
<4> [356.862325] WARNING: CPU: 3 PID: 0 at kernel/workqueue.c:1416 __queue_work+0x443/0x580
<4> [356.862335] ---[ end trace a9b766dc6d91a018 ]---
<4> [356.927102] ------------[ cut here ]------------
<4> [356.927119] Memory manager not clean during takedown.
<4> [356.927185] WARNING: CPU: 3 PID: 1381 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x1f/0x30
<4> [356.927196] Modules linked in: i915(+) snd_hda_codec_hdmi x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec btusb snd_hwdep btrtl ghash_clmulni_intel btbcm btintel snd_hda_core cdc_ether snd_pcm usbnet r8152 bluetooth mii ecdh_generic i2c_hid pinctrl_sunrisepoint pinctrl_intel prime_numbers [last unloaded: i915]
<4> [356.927323] CPU: 3 PID: 1381 Comm: i915_selftest Tainted: G     U  W         5.1.0-rc6-CI-CI_DRM_5990+ #1
<4> [356.927339] Hardware name: Google Caroline/Caroline, BIOS MrChromebox 08/27/2018
<4> [356.927359] RIP: 0010:drm_mm_takedown+0x1f/0x30
<4> [356.927372] Code: a6 ff 66 0f 1f 84 00 00 00 00 00 48 8b 47 38 48 8d 57 38 48 39 c2 75 02 f3 c3 53 48 89 fb 48 c7 c7 80 4b 0f 82 e8 81 eb a6 ff <0f> 0b 48 89 df 5b e9 b6 fe ff ff 66 0f 1f 44 00 00 53 48 83 bf 90
<4> [356.927383] RSP: 0018:ffffc90000223b18 EFLAGS: 00010286
<4> [356.927396] RAX: 0000000000000000 RBX: ffff88816baf8d68 RCX: 0000000000000000
<4> [356.927406] RDX: 0000000000000007 RSI: ffff8881784d30b8 RDI: 00000000ffffffff
<4> [356.927416] RBP: ffff8881797d0008 R08: 00000000eab59cdd R09: 0000000000000000
<4> [356.927426] R10: 0000000000000000 R11: 0000000000000000 R12: ffffffffa0695160
<4> [356.927436] R13: ffffffffa0770250 R14: ffffffffa07701e0 R15: ffffc90000223e98
<4> [356.927448] FS:  00007f3608adf980(0000) GS:ffff88817ab80000(0000) knlGS:0000000000000000
<4> [356.927459] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [356.927471] CR2: 0000557e3b1df508 CR3: 000000017a17c004 CR4: 00000000003606e0
<4> [356.927485] Call Trace:
<4> [356.927881]  i915_driver_unload+0xe9/0x120 [i915]
<4> [356.928249]  i915_pci_remove+0x19/0x30 [i915]
<4> [356.928507]  i915_pci_probe+0x60/0xa0 [i915]
<4> [356.928543]  pci_device_probe+0xa1/0x120
<4> [356.928586]  really_probe+0xf3/0x3e0
<4> [356.928625]  driver_probe_device+0x10a/0x120
<4> [356.928662]  device_driver_attach+0x4b/0x50
<4> [356.928698]  __driver_attach+0x97/0x130
<4> [356.928732]  ? device_driver_attach+0x50/0x50
<4> [356.928756]  bus_for_each_dev+0x74/0xc0
<4> [356.928797]  bus_add_driver+0x13f/0x210
<4> [356.928823]  ? 0xffffffffa07f2000
<4> [356.928846]  driver_register+0x56/0xe0
<4> [356.928860]  ? 0xffffffffa07f2000
<4> [356.928874]  do_one_initcall+0x58/0x2e0
<4> [356.928891]  ? do_init_module+0x1d/0x1ea
<4> [356.928912]  ? rcu_read_lock_sched_held+0x6f/0x80
<4> [356.928926]  ? kmem_cache_alloc_trace+0x261/0x290
<4> [356.928965]  do_init_module+0x56/0x1ea
<4> [356.928998]  load_module+0x2701/0x29e0
<4> [356.929100]  ? __se_sys_finit_module+0xd3/0xf0
<4> [356.929112]  __se_sys_finit_module+0xd3/0xf0
<4> [356.929166]  do_syscall_64+0x55/0x190
<4> [356.929189]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [356.929202] RIP: 0033:0x7f3608398839
<4> [356.929215] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [356.929226] RSP: 002b:00007ffeb32b5b58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [356.929240] RAX: ffffffffffffffda RBX: 00005606544f30e0 RCX: 00007f3608398839
<4> [356.929250] RDX: 0000000000000000 RSI: 00005606544f4140 RDI: 0000000000000006
<4> [356.929260] RBP: 00005606544f4140 R08: 0000000000000004 R09: 0000000000000000
<4> [356.929269] R10: 00007ffeb32b5cd0 R11: 0000000000000246 R12: 0000000000000000
<4> [356.929279] R13: 00005606544f0bf0 R14: 0000000000000020 R15: 0000000000000044
<4> [356.929328] irq event stamp: 973826
<4> [356.929349] hardirqs last  enabled at (973825): [<ffffffff81124e37>] console_unlock+0x3f7/0x5a0
<4> [356.929363] hardirqs last disabled at (973826): [<ffffffff810019b0>] trace_hardirqs_off_thunk+0x1a/0x1c
<4> [356.929376] softirqs last  enabled at (973774): [<ffffffff81c0033a>] __do_softirq+0x33a/0x4b9
<4> [356.929393] softirqs last disabled at (973767): [<ffffffff810b5519>] irq_exit+0xa9/0xc0
<4> [356.929411] WARNING: CPU: 3 PID: 1381 at drivers/gpu/drm/drm_mm.c:950 drm_mm_takedown+0x1f/0x30
<4> [356.929421] ---[ end trace a9b766dc6d91a019 ]---

Comment 1 CI Bug Log 2019-04-26 06:23:20 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: igt@i915_selftest@live_evict - incomplete - RIP: \d+:__queue_work
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5990/shard-skl2/igt@i915_selftest@live_evict.html

Comment 2 Chris Wilson 2019-04-26 06:32:27 UTC

That's a secondary bug from the previous failure. We left state dangling which then exploded when it ran at the wrong time.

Comment 3 Chris Wilson 2019-05-01 21:37:52 UTC

commit dc76e5764a46ffb2e7f502a86b3288b5edcce191 (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 1 14:57:51 2019 +0100

    drm/i915: Complete both freed-object passes before draining the workqueue
    
    The workqueue code complains viciously if we try to queue more work onto
    the queue while attampting to drain it. As we asynchronously free
    objects and defer their enqueuing with RCU, it is quite tricky to
    quiesce the system before attempting to drain the workqueue. Yet drain
    we must to ensure that the worker is idle before unloading the module.
    
    Give the freed object drain 3 whole passes with multiple rcu_barrier()
    to give the defer freeing of several levels each protected by RCU and
    needing a grace period before its parent can be freed, ultimately
    resulting in a GEM object being freed after another RCU period.
    
    A consequence is that it will make module unload even slower.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110550
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Matthew Auld <matthew.auld@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190501135753.8711-1-chris@chris-wilson.co.uk

Comment 4 Martin Peres 2019-07-08 08:42:29 UTC

(In reply to Chris Wilson from comment #3)
> commit dc76e5764a46ffb2e7f502a86b3288b5edcce191 (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed May 1 14:57:51 2019 +0100
> 
>     drm/i915: Complete both freed-object passes before draining the workqueue
>     
>     The workqueue code complains viciously if we try to queue more work onto
>     the queue while attampting to drain it. As we asynchronously free
>     objects and defer their enqueuing with RCU, it is quite tricky to
>     quiesce the system before attempting to drain the workqueue. Yet drain
>     we must to ensure that the worker is idle before unloading the module.
>     
>     Give the freed object drain 3 whole passes with multiple rcu_barrier()
>     to give the defer freeing of several levels each protected by RCU and
>     needing a grace period before its parent can be freed, ultimately
>     resulting in a GEM object being freed after another RCU period.
>     
>     A consequence is that it will make module unload even slower.
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=110550
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Matthew Auld <matthew.auld@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190501135753.8711-1-
> chris@chris-wilson.co.uk

Thanks, it did the trick!

Comment 5 CI Bug Log 2019-07-08 08:42:37 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.