109005 – i915 slab shrink cause a panic

Bug 109005 - i915 slab shrink cause a panic

Summary: i915 slab shrink cause a panic

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/iGVT-g (show other bugs)
Version:	XOrg git
Hardware:	x86 (IA32) Linux (All)

Importance:	medium critical
Assignee:	Terrence Xu
QA Contact:	Terrence Xu

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2018-12-11 00:54 UTC by Alek Du
Modified:	2019-01-15 17:50 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	BXT
i915 features:

Attachments

Description Alek Du 2018-12-11 00:54:15 UTC

During system stress test, when memory is under pressure, we meet:

[2018-11-28 23:17:54] [12278.310417] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4702!
[2018-11-28 23:17:54] [12278.310802] invalid opcode: 0000 [#1] PREEMPT SMP
[2018-11-28 23:17:54] [12278.311012] CPU: 0 PID: 61 Comm: kswapd0 Tainted: G     U  WC        4.19.0-26.iot-lts2018-sos #1
[2018-11-28 23:17:54] [12278.311393] RIP: 0010:i915_gem_wait_for_idle.part.78.cold.114+0x45/0x47
[2018-11-28 23:17:54] [12278.311675] Code: 7b 8b ae ff 48 8b 35 e6 92 3c 01 49 c7 c0 af 48 55 a9 b9 5e 12 00 00 48 c7 c2 50 7a 0b a9 48 c7 c7 f4 e6 60 a8 e8 37 38 b6 ff <0f> 0b 48 c7 c1 a8 59 55 a9 ba b8 12 00 00 48 c7 c6 20 7a 0b a9 48
[2018-11-28 23:17:55] [12278.312447] RSP: 0018:ffff8e31acd8bbb8 EFLAGS: 00010246
[2018-11-28 23:17:55] [12278.312673] RAX: 000000000000000e RBX: 000000000000000a RCX: 0000000000000000
[2018-11-28 23:17:55] [12278.312971] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff8e31ae841400
[2018-11-28 23:17:55] [12278.313268] RBP: ffff8e31acea8340 R08: 0000000001416578 R09: ffff8e31aea15000
[2018-11-28 23:17:55] [12278.313566] R10: ffff8e31ae807100 R11: ffff8e31ae841400 R12: ffff8e31acea0000
[2018-11-28 23:17:55] [12278.313863] R13: 00000b2ab1d38ed0 R14: 0000000000000000 R15: ffff8e31acd8bd70
[2018-11-28 23:17:55] [12278.314162] FS:  0000000000000000(0000) GS:ffff8e31afa00000(0000) knlGS:0000000000000000
[2018-11-28 23:17:55] [12278.314499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[2018-11-28 23:17:55] [12278.314741] CR2: 00007ff94948f000 CR3: 0000000226813000 CR4: 00000000003406f0
[2018-11-28 23:17:55] [12278.315039] Call Trace:
[2018-11-28 23:17:55] [12278.315162]  i915_gem_shrink+0x3b7/0x4b0
[2018-11-28 23:17:55] [12278.315340]  i915_gem_shrinker_scan+0x104/0x130
[2018-11-28 23:17:55] [12278.315537]  do_shrink_slab+0x12c/0x2c0
[2018-11-28 23:17:55] [12278.315706]  shrink_slab+0x225/0x2c0
[2018-11-28 23:17:55] [12278.315864]  shrink_node+0xe4/0x430
[2018-11-28 23:17:55] [12278.316018]  kswapd+0x3ce/0x730
[2018-11-28 23:17:55] [12278.316161]  ? mem_cgroup_shrink_node+0x1a0/0x1a0
[2018-11-28 23:17:55] [12278.316365]  kthread+0x11e/0x140
[2018-11-28 23:17:55] [12278.316508]  ? kthread_create_worker_on_cpu+0x70/0x70
[2018-11-28 23:17:55] [12278.316727]  ret_from_fork+0x3a/0x50
[2018-11-28 23:17:55] [12278.316884] Modules linked in: igb_avb(C) xhci_pci xhci_hcd dca ici_isys_mod ipu4_acpi intel_ipu4_isys_csslib intel_ipu4_psys intel_ipu4_psys_csslib intel_ipu4_mmu intel_ipu4 iova crlmodule_lite

More specifically,
./drivers/gpu/drm/i915/i915_gem.c 4702

err = wait_for_engines(i915);
if (err)
return err;

i915_retire_requests(i915);
GEM_BUG_ON(i915->gt.active_requests);

Comment 1 Chris Wilson 2018-12-11 09:17:18 UTC

4.19.0-26.xxx is not our upstream kernel. Note your dmesg extract lacks a couple of lines of context, and that if you enable the DEBUG_GEM you probably also want TRACE_GEM to figure out how it got itself into such a state.

But first and foremost, focus on a known kernel.

Comment 2 Alek Du 2018-12-11 09:31:38 UTC

Thanks for review. But can you tell me what's a known kernel? 4.19 is current LTS kernel ...

TRACE_GEM is "y" too for my case.

The log I pasted is all related with i915, I don't know what part you think I missed a context, please point it out thanks!

Alek

Comment 3 Jani Saarinen 2018-12-12 07:57:48 UTC

Please try to reproduce the error using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

What system this issue was seen?

Comment 4 Alek Du 2018-12-12 09:11:11 UTC

It is an Intel A3960 platform, let me figure out how to trigger the issue with DRM-TIP kernel. The problem is to let the platform fully run, there need other patches to be porting over...

Thanks,
Alek

Comment 5 Jani Saarinen 2018-12-12 09:26:11 UTC

So based on info that is APL / BXT?

Comment 6 Alek Du 2018-12-12 09:28:45 UTC

(In reply to Jani Saarinen from comment #5)
> So based on info that is APL / BXT?

Yes, correct!

Comment 7 Yang Bin 2018-12-20 08:57:11 UTC

I had reproduced this issue successfully and found the root cause as below.

i915_gem_wait_for_idle() waits for all requests being completed and
calls i915_retire_requests() to retire them. It assumes the
active_requests should be zero finally.

In i915_retire_requests(), it will retire all requests on the active
rings. Unfortunately, active_requests is increased in
i915_request_alloc() and reduced in i915_request_retire(), but the
request is added into active rings in i915_request_add().

If i915_gem_wait_for_idle() is called between i915_request_alloc()
and i915_request_add(), this request will not be retired. Then, the
active_requests will not be zero in the end.

Normally, i915_request_alloc() and i915_request_add() will be called
in sequence with drm.struct_mutex locked. But in
intel_vgpu_create_workload(), it will pre-allocate the request and
call i915_request_add() in the workload thread for performance
optimization. The above issue will be triggered.

I had submitted a RFC patch and continue to discuss it in maillist. Please refer to https://lkml.org/lkml/2018/12/20/78

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.