Created attachment 135040 [details] dmesg_live_contexts drv_selftests@live_contexts gets stuck forever, dmesg shows warn: WARN_ON(i915_gem_object_has_pinned_pages(obj)) [ 1170.532089] ------------[ cut here ]------------ [ 1170.532119] WARNING: CPU: 6 PID: 72 at drivers/gpu/drm/i915/i915_gem.c:4553 __i915_gem_free_objects+0x2ad/0x2c0 [i915] [ 1170.532119] Modules linked in: i915(+) drm_kms_helper drm ip6table_filter ip6_tables cmac bnep iptable_filter arc4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hda_core snd_hwdep intel_rapl x86_pkg_temp_thermal snd_pcm intel_powerclamp coretemp snd_seq_midi snd_seq_midi_event kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iwlmvm pcbc snd_rawmidi mac80211 snd_seq snd_seq_device snd_timer aesni_intel binfmt_misc aes_x86_64 crypto_simd glue_helper cryptd intel_cstate nls_iso8859_1 intel_rapl_perf input_leds ir_rc6_decoder snd iwlwifi cfg80211 btusb btrtl btbcm soundcore mei_me mei shpchp intel_pch_thermal hci_uart btqca rc_rc6_mce btintel bluetooth ir_lirc_codec lirc_dev nuvoton_cir rc_core ecdh_generic intel_lpss_acpi intel_lpss acpi_als [ 1170.532141] acpi_pad mac_hid kfifo_buf industrialio parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid i2c_algo_bit prime_numbers syscopyarea sysfillrect sysimgblt e1000e fb_sys_fops nvme sdhci_pci ptp pps_core nvme_core sdhci wmi pinctrl_sunrisepoint i2c_hid video pinctrl_intel hid [last unloaded: drm] [ 1170.532152] CPU: 6 PID: 72 Comm: kworker/6:1 Tainted: G U 4.14.0-rc6-drm-intel-qa-ww43-commit-5c82a37+ #1 [ 1170.532153] Hardware name: /NUC6i7KYB, BIOS KYSKLi70.86A.0050.2017.0831.1924 08/31/2017 [ 1170.532178] Workqueue: events __i915_gem_free_work [i915] [ 1170.532178] task: ffff8fb999d74d80 task.stack: ffffa951833fc000 [ 1170.532200] RIP: 0010:__i915_gem_free_objects+0x2ad/0x2c0 [i915] [ 1170.532200] RSP: 0018:ffffa951833ffdf0 EFLAGS: 00010286 [ 1170.532201] RAX: 000000000000002e RBX: ffff8fb2d4ad4300 RCX: 0000000000000000 [ 1170.532202] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000246 [ 1170.532202] RBP: ffffa951833ffe30 R08: 000000000000002e R09: ffffffff8e352300 [ 1170.532203] R10: ffffa951831efe38 R11: 000000000000002e R12: ffff8fb2d4ad43f8 [ 1170.532203] R13: ffff8fb99850ad00 R14: ffff8fb2c3df8000 R15: ffff8fb2c3dfbf10 [ 1170.532204] FS: 0000000000000000(0000) GS:ffff8fb9be980000(0000) knlGS:0000000000000000 [ 1170.532204] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1170.532205] CR2: 0000561c30bb7e80 CR3: 000000064c609003 CR4: 00000000003606e0 [ 1170.532206] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 1170.532206] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 1170.532207] Call Trace: [ 1170.532228] __i915_gem_free_work+0x38/0x50 [i915] [ 1170.532231] process_one_work+0x156/0x410 [ 1170.532232] worker_thread+0x4b/0x410 [ 1170.532234] kthread+0x109/0x140 [ 1170.532235] ? process_one_work+0x410/0x410 [ 1170.532236] ? kthread_create_on_node+0x70/0x70 [ 1170.532238] ret_from_fork+0x25/0x30 [ 1170.532239] Code: 49 8b 04 24 48 85 c0 75 e9 e9 d6 fd ff ff e8 fb 23 9f cc e9 d5 fe ff ff 48 c7 c6 30 59 f3 c0 48 c7 c7 b9 4b f2 c0 e8 04 ad 49 cc <0f> ff c7 83 d8 01 00 00 00 00 00 00 e9 7e fe ff ff 66 90 0f 1f [ 1170.532254] ---[ end trace 5d85e68ee1f1aef7 ]---
kasan would be useful here; looks like a double unref (or missing ref).
Rising priority since it got our test cycles stuck for hours until we manually power platforms off.
Now would seem to be the perfect time to work on automated oops handling (or just turn on panic_on_oops)... I assure you this won't be the last.
commit f991c492aa55fb1c6834882c5d786d5bb3b25f07 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 6 11:15:08 2017 +0000 drm/i915: Lock llist_del_first() vs llist_del_all() An oversight in commit 87701b4b5593 ("drm/i915: Only free the oldest stale object before a fresh allocation") was that not only do we have to serialise concurrent users of llist_del_first(), but we also have to lock llist_del_first() vs llist_del_all(). From llist.h, * This can be summarized as follows: * * | add | del_first | del_all * add | - | - | - * del_first | | L | L * del_all | | | - * * Where, a particular row's operation can happen concurrently with a column's * operation, with "-" being no lock needed, while "L" being lock is needed. This should hopefully explain: <4>[ 89.287106] general protection fault: 0000 [#1] PREEMPT SMP <4>[ 89.287126] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp i915 crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core r8169 mii mei_me mei snd_pcm prime_numbers i2c_hid pinctrl_geminilake pinctrl_intel <4>[ 89.287226] CPU: 2 PID: 23 Comm: ksoftirqd/2 Tainted: G U 4.14.0-rc8-CI-CI_DRM_3315+ #1 <4>[ 89.287247] Hardware name: Intel Corp. Geminilake/GLK RVP2 LP4SD (07), BIOS GELKRVPA.X64.0062.B30.1708222146 08/22/2017 <4>[ 89.287270] task: ffff88017ab34ec0 task.stack: ffffc90000128000 <4>[ 89.287290] RIP: 0010:llist_add_batch+0x4/0x20 <4>[ 89.287301] RSP: 0018:ffffc9000012bdb8 EFLAGS: 00010296 <4>[ 89.287314] RAX: ffffffff811017ad RBX: 6e468801a1560000 RCX: ef3e53fceecdeb81 <4>[ 89.287330] RDX: 6e468801a1566130 RSI: ffff880103d73d98 RDI: ffff880103d73d98 <4>[ 89.287346] RBP: ffffc9000012bdb8 R08: ffff88017ab35780 R09: 0000000000000000 <4>[ 89.287361] R10: ffffc9000012bd68 R11: 00000000abb18c3d R12: ffffffffa01369e0 <4>[ 89.287377] R13: ffff88017fd1b8f8 R14: ffff88017ab34ec0 R15: 000000000000000a <4>[ 89.287393] FS: 0000000000000000(0000) GS:ffff88017fd00000(0000) knlGS:0000000000000000 <4>[ 89.287411] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 89.287424] CR2: 00007ff0c0755018 CR3: 000000016df9b000 CR4: 00000000003406e0 <4>[ 89.287440] Call Trace: <4>[ 89.287511] __i915_gem_free_object_rcu+0x20/0x40 [i915] <4>[ 89.287527] rcu_process_callbacks+0x27a/0x730 <4>[ 89.287544] __do_softirq+0xc0/0x4ae <4>[ 89.287559] ? smpboot_thread_fn+0x2d/0x280 <4>[ 89.287571] run_ksoftirqd+0x1f/0x70 <4>[ 89.287582] smpboot_thread_fn+0x18a/0x280 <4>[ 89.287595] kthread+0x114/0x150 <4>[ 89.287605] ? sort_range+0x30/0x30 <4>[ 89.287615] ? kthread_create_on_node+0x40/0x40 <4>[ 89.287628] ret_from_fork+0x27/0x40 <4>[ 89.287641] Code: 0d 48 83 ea 01 4c 89 c1 48 83 fa ff 74 12 48 23 0c d7 74 ed 48 c1 e2 06 48 0f bd c9 48 8d 04 0a 5d c3 90 90 90 90 90 55 48 89 e5 <48> 8b 0a 48 89 0e 48 89 c8 f0 48 0f b1 3a 48 39 c1 75 ed 48 85 <1>[ 89.287774] RIP: llist_add_batch+0x4/0x20 RSP: ffffc9000012bdb8 <4>[ 89.287826] ---[ end trace e775d15174d8ae02 ]--- (Lockless lists are only easy (and lockless) when only using llist_add/llist_del_all!)
$ : sudo -E ./drv_selftest --r live_contexts --d IGT-Version: 1.20-g88d6550 (x86_64) (Linux: 4.14.0-drm-intel-qa-ww46-commit-06718a2+ x86_64) (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: err == 0 || err == -ENOENT (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: igt_kselftest_begin(&tst) == 0 (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: !igt_list_empty(&tests) (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: err == 0 || err == -ENOENT (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: igt_kselftest_begin(&tst) == 0 (drv_selftest:12645) igt-core-DEBUG: Starting subtest: live_contexts Subtest live_contexts: SUCCESS (1.988s) (drv_selftest:12645) igt-kmod-DEBUG: Test requirement passed: !igt_list_empty(&tests) (drv_selftest:12645) igt-core-DEBUG: Exiting with status code 0 Verified.
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.