Created attachment 135023 [details] dmesg_mock_evict We are hitting same problem as described in bug 102973 with latest kernel: $uname -a Linux SKL-5-NUC6i7KYB 4.14.0-rc6-drm-intel-qa-ww43-commit-5c82a37+ #1 SMP Tue Oct 24 07:34:21 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux With drv_selftest@mock_evict and drv_selftest@live_uncore. [ 345.788900] [IGT] drv_selftest: starting subtest mock_evict [ 345.833104] Setting dangerous option enable_guc_submission - tainting kernel [ 345.833106] Setting dangerous option enable_guc_loading - tainting kernel [ 345.833107] Setting dangerous option alpha_support - tainting kernel [ 345.833110] Setting dangerous option mock_selftests - tainting kernel [ 345.833334] i915: Performing mock selftests with st_random_seed=0xf07c3152 st_timeout=1000 [ 372.248023] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [drv_selftest:1470] [ 372.248027] Modules linked in: i915(+) drm_kms_helper drm ip6table_filter ip6_tables cmac iptable_filter bnep arc4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec snd_hda_core snd_hwdep snd_pcm intel_rapl snd_seq_midi snd_seq_midi_event x86_pkg_temp_thermal intel_powerclamp snd_rawmidi coretemp kvm_intel binfmt_misc iwlmvm kvm mac80211 irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_seq nls_iso8859_1 snd_seq_device ir_rc6_decoder snd_timer iwlwifi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf cfg80211 btusb btrtl hci_uart btbcm snd btqca btintel mei_me input_leds soundcore mei shpchp intel_pch_thermal rc_rc6_mce bluetooth ir_lirc_codec lirc_dev nuvoton_cir rc_core ecdh_generic intel_lpss_acpi intel_lpss acpi_als [ 372.248080] acpi_pad kfifo_buf mac_hid industrialio parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic usbhid i2c_algo_bit prime_numbers syscopyarea e1000e sysfillrect sysimgblt fb_sys_fops nvme ptp nvme_core pps_core sdhci_pci sdhci wmi i2c_hid video pinctrl_sunrisepoint hid pinctrl_intel [last unloaded: i915] [ 372.248113] CPU: 2 PID: 1470 Comm: drv_selftest Tainted: G U 4.14.0-rc6-drm-intel-qa-ww43-commit-5c82a37+ #1 [ 372.248113] Hardware name: /NUC6i7KYB, BIOS KYSKLi70.86A.0050.2017.0831.1924 08/31/2017 [ 372.248114] task: ffff980798798f80 task.stack: ffffb6ae44228000 [ 372.248144] RIP: 0010:i915_gem_evict_something+0x149/0x470 [i915] [ 372.248145] RSP: 0018:ffffb6ae4422bac8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10 [ 372.248146] RAX: ffff980768e0f630 RBX: 00000000000003c1 RCX: ffffb6ae4422baf0 [ 372.248146] RDX: 0000000000000000 RSI: 0000000000000003 RDI: 0000000000000000 [ 372.248146] RBP: ffffb6ae4422bb98 R08: 0000000000000000 R09: ffff98077304dfe0 [ 372.248147] R10: ffffffffffffffff R11: 0000000000fff000 R12: ffffb6ae4422baf0 [ 372.248147] R13: 0000000000000000 R14: ffff980768e0f440 R15: ffff98077304c520 [ 372.248148] FS: 00007fe951dfb880(0000) GS:ffff9807be880000(0000) knlGS:0000000000000000 [ 372.248148] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 372.248149] CR2: 00007fc56c9da000 CR3: 0000000858547004 CR4: 00000000003606e0 [ 372.248150] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 372.248150] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 372.248150] Call Trace: [ 372.248176] ? checked_vma_instance+0x120/0x140 [i915] [ 372.248197] igt_evict_something+0x56/0xd0 [i915] [ 372.248222] __i915_subtests+0x47/0xd0 [i915] [ 372.248242] i915_gem_evict_mock_selftests+0x42/0x70 [i915] [ 372.248266] __run_selftests+0x155/0x1e0 [i915] [ 372.248267] ? 0xffffffffc054a000 [ 372.248288] i915_mock_selftests+0x30/0x60 [i915] [ 372.248307] i915_init+0xa/0x6e [i915] [ 372.248309] do_one_initcall+0x53/0x190 [ 372.248311] ? __vunmap+0x81/0xb0 [ 372.248312] ? kmem_cache_alloc_trace+0xe7/0x1d0 [ 372.248314] do_init_module+0x5f/0x209 [ 372.248314] load_module+0x2735/0x2c80 [ 372.248317] ? ima_post_read_file+0x7d/0xa0 [ 372.248318] SYSC_finit_module+0xe5/0x120 [ 372.248319] ? SYSC_finit_module+0xe5/0x120 [ 372.248320] SyS_finit_module+0xe/0x10 [ 372.248322] entry_SYSCALL_64_fastpath+0x1e/0xa9 [ 372.248322] RIP: 0033:0x7fe95030fd29 [ 372.248323] RSP: 002b:00007ffcb04be638 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 [ 372.248324] RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007fe95030fd29 [ 372.248324] RDX: 0000000000000000 RSI: 0000000001a00670 RDI: 0000000000000012 [ 372.248325] RBP: 00007ffcb04bd640 R08: 0000000000000000 R09: 000000000000003c [ 372.248325] R10: 0000000000000012 R11: 0000000000000246 R12: 0000000001a01a20 [ 372.248326] R13: 00007ffcb04bd620 R14: 0000000000000005 R15: 0000000000000000 [ 372.248326] Code: 8d b0 10 fe ff ff 75 1c e9 47 01 00 00 49 8b 86 f0 01 00 00 49 39 c7 4c 8d b0 10 fe ff ff 0f 84 30 01 00 00 41 8b 9e f8 00 00 00 <83> e3 0f 75 dd 45 85 ed 74 0c 49 8b 86 f8 00 00 00 f6 c4 08 75
(In reply to Elizabeth from comment #0) > Created attachment 135023 [details] > dmesg_mock_evict > > We are hitting same problem as described in bug 102973 with latest kernel: No, that is not the same bug. > $uname -a > Linux SKL-5-NUC6i7KYB 4.14.0-rc6-drm-intel-qa-ww43-commit-5c82a37+ #1 SMP > Tue Oct 24 07:34:21 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux > > With drv_selftest@mock_evict and drv_selftest@live_uncore. mock_evict and live_uncore are two very different tests.
Created attachment 135024 [details] dmesg_live_contexts drv_selftests@live_contexts also has the same behavior, once the test start running, the test gets stuck forever, though dmesg error is different for this test, it sent a warn: WARNING: CPU: 6 PID: 72 at drivers/gpu/drm/i915/i915_gem.c:4553 __i915_gem_free_objects+0x2ad/0x2c0 [i915]
(In reply to Chris Wilson from comment #1) > (In reply to Elizabeth from comment #0) > ... > mock_evict and live_uncore are two very different tests. live_uncore re-tested. Platform "dies", no display output nor ssh, after command: $sudo -E ./drv_selftests --r live_uncore until power reset. Opening a new bug for this test.
Reference to: https://patchwork.freedesktop.org/series/32576/
commit 20ccd4d3f689ac14dce8632d76769be0ac952060 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Oct 24 23:08:55 2017 +0100 drm/i915: Use same test for eviction and submitting kernel context During evict, we wish to idle the GPU if we see that the GGTT is full. However, our test for idle in i915_gem_evict_something() and in i915_gem_switch_to_kernel_context() do not match leading to disappointment - we never believe that we are idle and keep trying to flush the GGTT ad infinitum. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103438 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171024220855.30155-2-chris@chris-wilson.co.uk Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Fixes the first issue of mock_evict. The trace for live_contexts needs kasan and a confirmation that you are carrying the core-for-CI fixups? Also since it is a separate bug, please refile it.
(03:04 AM) [gfx@SKL-5-NUC6i7KYB] [~]$ : uname -a Linux SKL-5-NUC6i7KYB 4.14.0-rc6-drm-tip-ww43-commit-619a850+ #1 SMP Wed Oct 25 09:14:59 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux (03:05 AM) [gfx@SKL-5-NUC6i7KYB] [~]$ : sudo -E ./intel-graphics/intel-gpu-tools/tests/drv_selftest --r mock_evict --d IGT-Version: 1.20-ge7742ee (x86_64) (Linux: 4.14.0-rc6-drm-tip-ww43-commit-619a850+ x86_64) (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: err == 0 || err == -ENOENT (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: igt_kselftest_begin(&tst) == 0 (drv_selftest:1985) igt-core-DEBUG: Starting subtest: mock_evict Subtest mock_evict: SUCCESS (0.274s) (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: !igt_list_empty(&tests) (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: err == 0 || err == -ENOENT (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: igt_kselftest_begin(&tst) == 0 (drv_selftest:1985) igt-kmod-DEBUG: Test requirement passed: !igt_list_empty(&tests) (drv_selftest:1985) igt-core-DEBUG: Exiting with status code 0 Verified.
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.