During system stress test, when memory is under pressure, we meet: [2018-11-28 23:17:54] [12278.310417] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4702! [2018-11-28 23:17:54] [12278.310802] invalid opcode: 0000 [#1] PREEMPT SMP [2018-11-28 23:17:54] [12278.311012] CPU: 0 PID: 61 Comm: kswapd0 Tainted: G U WC 4.19.0-26.iot-lts2018-sos #1 [2018-11-28 23:17:54] [12278.311393] RIP: 0010:i915_gem_wait_for_idle.part.78.cold.114+0x45/0x47 [2018-11-28 23:17:54] [12278.311675] Code: 7b 8b ae ff 48 8b 35 e6 92 3c 01 49 c7 c0 af 48 55 a9 b9 5e 12 00 00 48 c7 c2 50 7a 0b a9 48 c7 c7 f4 e6 60 a8 e8 37 38 b6 ff <0f> 0b 48 c7 c1 a8 59 55 a9 ba b8 12 00 00 48 c7 c6 20 7a 0b a9 48 [2018-11-28 23:17:55] [12278.312447] RSP: 0018:ffff8e31acd8bbb8 EFLAGS: 00010246 [2018-11-28 23:17:55] [12278.312673] RAX: 000000000000000e RBX: 000000000000000a RCX: 0000000000000000 [2018-11-28 23:17:55] [12278.312971] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8e31ae841400 [2018-11-28 23:17:55] [12278.313268] RBP: ffff8e31acea8340 R08: 0000000001416578 R09: ffff8e31aea15000 [2018-11-28 23:17:55] [12278.313566] R10: ffff8e31ae807100 R11: ffff8e31ae841400 R12: ffff8e31acea0000 [2018-11-28 23:17:55] [12278.313863] R13: 00000b2ab1d38ed0 R14: 0000000000000000 R15: ffff8e31acd8bd70 [2018-11-28 23:17:55] [12278.314162] FS: 0000000000000000(0000) GS:ffff8e31afa00000(0000) knlGS:0000000000000000 [2018-11-28 23:17:55] [12278.314499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [2018-11-28 23:17:55] [12278.314741] CR2: 00007ff94948f000 CR3: 0000000226813000 CR4: 00000000003406f0 [2018-11-28 23:17:55] [12278.315039] Call Trace: [2018-11-28 23:17:55] [12278.315162] i915_gem_shrink+0x3b7/0x4b0 [2018-11-28 23:17:55] [12278.315340] i915_gem_shrinker_scan+0x104/0x130 [2018-11-28 23:17:55] [12278.315537] do_shrink_slab+0x12c/0x2c0 [2018-11-28 23:17:55] [12278.315706] shrink_slab+0x225/0x2c0 [2018-11-28 23:17:55] [12278.315864] shrink_node+0xe4/0x430 [2018-11-28 23:17:55] [12278.316018] kswapd+0x3ce/0x730 [2018-11-28 23:17:55] [12278.316161] ? mem_cgroup_shrink_node+0x1a0/0x1a0 [2018-11-28 23:17:55] [12278.316365] kthread+0x11e/0x140 [2018-11-28 23:17:55] [12278.316508] ? kthread_create_worker_on_cpu+0x70/0x70 [2018-11-28 23:17:55] [12278.316727] ret_from_fork+0x3a/0x50 [2018-11-28 23:17:55] [12278.316884] Modules linked in: igb_avb(C) xhci_pci xhci_hcd dca ici_isys_mod ipu4_acpi intel_ipu4_isys_csslib intel_ipu4_psys intel_ipu4_psys_csslib intel_ipu4_mmu intel_ipu4 iova crlmodule_lite More specifically, ./drivers/gpu/drm/i915/i915_gem.c 4702 err = wait_for_engines(i915); if (err) return err; i915_retire_requests(i915); GEM_BUG_ON(i915->gt.active_requests);
4.19.0-26.xxx is not our upstream kernel. Note your dmesg extract lacks a couple of lines of context, and that if you enable the DEBUG_GEM you probably also want TRACE_GEM to figure out how it got itself into such a state. But first and foremost, focus on a known kernel.
Thanks for review. But can you tell me what's a known kernel? 4.19 is current LTS kernel ... TRACE_GEM is "y" too for my case. The log I pasted is all related with i915, I don't know what part you think I missed a context, please point it out thanks! Alek
Please try to reproduce the error using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot. What system this issue was seen?
It is an Intel A3960 platform, let me figure out how to trigger the issue with DRM-TIP kernel. The problem is to let the platform fully run, there need other patches to be porting over... Thanks, Alek
So based on info that is APL / BXT?
(In reply to Jani Saarinen from comment #5) > So based on info that is APL / BXT? Yes, correct!
I had reproduced this issue successfully and found the root cause as below. i915_gem_wait_for_idle() waits for all requests being completed and calls i915_retire_requests() to retire them. It assumes the active_requests should be zero finally. In i915_retire_requests(), it will retire all requests on the active rings. Unfortunately, active_requests is increased in i915_request_alloc() and reduced in i915_request_retire(), but the request is added into active rings in i915_request_add(). If i915_gem_wait_for_idle() is called between i915_request_alloc() and i915_request_add(), this request will not be retired. Then, the active_requests will not be zero in the end. Normally, i915_request_alloc() and i915_request_add() will be called in sequence with drm.struct_mutex locked. But in intel_vgpu_create_workload(), it will pre-allocate the request and call i915_request_add() in the workload thread for performance optimization. The above issue will be triggered. I had submitted a RFC patch and continue to discuss it in maillist. Please refer to https://lkml.org/lkml/2018/12/20/78
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.