Bug 106832 - [i915]Kernel panic occurs after ECHO 15 to /sys/kernel/debug/dri/0/i915_wedged
Summary: [i915]Kernel panic occurs after ECHO 15 to /sys/kernel/debug/dri/0/i915_wedged
Status: CLOSED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-06-06 05:42 UTC by Owen Zhang
Modified: 2018-07-09 21:29 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
for reproducer (20.28 MB, application/x-gzip)
2018-06-06 05:42 UTC, Owen Zhang
no flags Details
git bisect log (3.07 KB, text/plain)
2018-06-06 05:43 UTC, Owen Zhang
no flags Details
fix patch by bisect (5.08 KB, text/plain)
2018-06-06 05:43 UTC, Owen Zhang
no flags Details
console log (15.93 KB, text/plain)
2018-06-06 05:44 UTC, Owen Zhang
no flags Details
draft one fix patch on kernel 4.14 (4.95 KB, patch)
2018-07-06 02:26 UTC, Owen Zhang
no flags Details | Splinter Review

Description Owen Zhang 2018-06-06 05:42:17 UTC
Created attachment 140042 [details]
for reproducer

1. The console message:
[  489.724739] [drm] GPU HANG: ecode 9:0:0xfef77ffe, reason: Manually setting wedged to 15, action: reset
[  489.733997] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  489.739675] i915 0000:00:02.0: Resetting bcs0 after gpu hang
[  489.745400] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  489.751182] i915 0000:00:02.0: Resetting chip after gpu hang
[  489.756901] [drm] RC6 on
[  497.695484] i915 0000:00:02.0: Resetting vcs0 after gpu hang
[  497.701381] BUG: unable to handle kernel NULL pointer dereference at 0000000000000070
[  497.709173] IP: reset_common_ring+0x23/0x140 [i915]
[  497.714012] PGD 0 P4D 0
[  497.716526] Oops: 0000 [#1] SMP PTI
[  497.719988] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter snd_hda_codec_hdmi rfkill snd_hda_codec_realtek intel_rapl snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel intel_powerclamp snd_hda_codec coretemp kvm_intel snd_hda_core kvm snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer irqbypass mei_me snd crct10dif_pclmul dell_wmi crc32_pclmul
[  497.790676]  ghash_clmulni_intel pcbc iTCO_wdt dell_smbios iTCO_vendor_support mei aesni_intel crypto_simd sparse_keymap wmi_bmof dcdbas glue_helper cryptd shpchp soundcore nfsd sg i2c_i801 pcspkr acpi_pad auth_rpcgss wmi nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm e1000e ahci libahci ptp libata i2c_core pps_core crc32c_intel serio_raw video dm_mirror dm_region_hash dm_log dm_mod
[  497.832418] CPU: 4 PID: 308 Comm: kworker/4:2 Not tainted 4.14.20-mss-pv5+ #13
[  497.839583] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.1.1 10/07/2015
[  497.846932] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  497.852889] task: ffff9e931c3e4740 task.stack: ffffb9ec413d4000
[  497.858773] RIP: 0010:reset_common_ring+0x23/0x140 [i915]
[  497.864128] RSP: 0018:ffffb9ec413d7c50 EFLAGS: 00010246
[  497.869312] RAX: ffffffffc03e9290 RBX: ffff9e9323745200 RCX: 0000000000000006
[  497.876392] RDX: 0000000000000000 RSI: ffff9e9323745200 RDI: 0000000000000000
[  497.883469] RBP: ffff9e9246426000 R08: 0000000000000000 R09: 00000000000009f9
[  497.890547] R10: 0000000000000007 R11: 0000000000000000 R12: ffff9e9323745200
[  497.897628] R13: 00000000ffffffff R14: ffff9e931a09cdc8 R15: ffff9e9246426000
[  497.904708] FS:  0000000000000000(0000) GS:ffff9e9331d00000(0000) knlGS:0000000000000000
[  497.912734] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  497.918435] CR2: 0000000000000070 CR3: 00000001ee00a005 CR4: 00000000003606e0
[  497.925515] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  497.932594] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  497.939671] Call Trace:
[  497.942105]  i915_reset_engine+0x5e/0xf0 [i915]
[  497.946606]  i915_handle_error+0x21e/0x3e0 [i915]
[  497.951274]  ? vsnprintf+0x203/0x4d0
[  497.954820]  ? vscnprintf+0x9/0x20
[  497.958193]  ? scnprintf+0x49/0x70
[  497.961575]  hangcheck_declare_hang+0xd8/0x110 [i915]
[  497.966595]  ? fwtable_read32+0x83/0x190 [i915]
[  497.971095]  i915_hangcheck_elapsed+0x2cf/0x380 [i915]
[  497.976196]  process_one_work+0x141/0x340
[  497.980183]  worker_thread+0x47/0x3e0
[  497.983820]  kthread+0xfc/0x130
[  497.986943]  ? rescuer_thread+0x380/0x380
[  497.990928]  ? kthread_park+0x60/0x60
[  497.994566]  ? do_syscall_64+0x6f/0x1a0
[  497.998378]  ? SyS_exit_group+0x10/0x10
[  498.002190]  ret_from_fork+0x35/0x40
[  498.005739] Code: 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 48 85 f6 55 48 89 fd 53 48 89 f3 0f 84 db 00 00 00 48 8b bf 88 03 00 00 48 83 e7 fc <48> 8b 47 70 48 39 46 70 74 29 48 85 ff 74 0b f0 ff 0f 0f 88 93
[  498.024481] RIP: reset_common_ring+0x23/0x140 [i915] RSP: ffffb9ec413d7c50
[  498.031301] CR2: 0000000000000070
[  498.034592] ---[ end trace e7c5283a77cf3e17 ]---
[  498.039170] Kernel panic - not syncing: Fatal exception
[  498.044399] Kernel Offset: 0x7000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  498.055011] ---[ end Kernel panic - not syncing: Fatal exception
[  498.060975] ------------[ cut here ]------------
[  498.065557] WARNING: CPU: 4 PID: 308 at kernel/sched/core.c:1178 set_task_cpu+0x184/0x190
[  498.073670] Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter snd_hda_codec_hdmi rfkill snd_hda_codec_realtek intel_rapl snd_hda_codec_generic x86_pkg_temp_thermal snd_hda_intel intel_powerclamp snd_hda_codec coretemp kvm_intel snd_hda_core kvm snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer irqbypass mei_me snd crct10dif_pclmul dell_wmi crc32_pclmul
[  498.144270]  ghash_clmulni_intel pcbc iTCO_wdt dell_smbios iTCO_vendor_support mei aesni_intel crypto_simd sparse_keymap wmi_bmof dcdbas glue_helper cryptd shpchp soundcore nfsd sg i2c_i801 pcspkr acpi_pad auth_rpcgss wmi nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm e1000e ahci libahci ptp libata i2c_core pps_core crc32c_intel serio_raw video dm_mirror dm_region_hash dm_log dm_mod
[  498.185999] CPU: 4 PID: 308 Comm: kworker/4:2 Tainted: G      D         4.14.20-mss-pv5+ #13
[  498.194369] Hardware name: Dell Inc. OptiPlex 5040/0R790T, BIOS 1.1.1 10/07/2015
[  498.201713] Workqueue: events_long i915_hangcheck_elapsed [i915]
[  498.207673] task: ffff9e931c3e4740 task.stack: ffffb9ec413d4000
[  498.213546] RIP: 0010:set_task_cpu+0x184/0x190
[  498.217961] RSP: 0018:ffff9e9331d03cf8 EFLAGS: 00010046
[  498.223153] RAX: 0000000000000200 RBX: ffff9e93184b0000 RCX: 0000000000000001
[  498.230248] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff9e93184b0000
[  498.237339] RBP: 0000000000022600 R08: 00000000000000ff R09: 0000000000000000
[  498.244427] R10: 000000005b0e7cd0 R11: 000000002276b083 R12: 0000000000000000
[  498.251507] R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000000
[  498.258588] FS:  0000000000000000(0000) GS:ffff9e9331d00000(0000) knlGS:0000000000000000
[  498.266616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  498.272317] CR2: 0000000000000070 CR3: 00000001ee00a005 CR4: 00000000003606e0
[  498.279398] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  498.286478] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  498.293556] Call Trace:
[  498.295980]  <IRQ>
[  498.297974]  try_to_wake_up+0x161/0x440
[  498.301778]  __wake_up_common+0x8a/0x150
[  498.305669]  ep_poll_callback+0xc9/0x2e0
[  498.309560]  __wake_up_common+0x8a/0x150
[  498.313450]  __wake_up_common_lock+0x7a/0xc0
[  498.317686]  irq_work_run_list+0x48/0x70
[  498.321576]  ? tick_sched_do_timer+0x60/0x60
[  498.325813]  update_process_times+0x3b/0x50
[  498.329962]  tick_sched_handle+0x26/0x60
[  498.333853]  tick_sched_timer+0x34/0x70
[  498.337659]  __hrtimer_run_queues+0xdc/0x220
[  498.341896]  hrtimer_interrupt+0x99/0x190
[  498.345872]  smp_apic_timer_interrupt+0x5a/0x120
[  498.350452]  apic_timer_interrupt+0xa2/0xb0
[  498.354601]  </IRQ>
[  498.356683] RIP: 0010:panic+0x1fa/0x23c
[  498.360488] RSP: 0018:ffffb9ec413d7a10 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  498.367999] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
[  498.375080] RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff9e9331d16970
[  498.382161] RBP: ffffb9ec413d7a80 R08: 0000000000000000 R09: 0000000000000a27
[  498.389238] R10: 0000000000000000 R11: ffffb9ec413d7780 R12: ffffffff88e44d50
[  498.396317] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  498.403396]  oops_end+0xaf/0xc0
[  498.406510]  no_context+0x1b3/0x400
[  498.409971]  __do_page_fault+0x97/0x4d0
[  498.413778]  do_page_fault+0x33/0x120
[  498.417410]  page_fault+0x2c/0x60
[  498.420706] RIP: 0010:reset_common_ring+0x23/0x140 [i915]
[  498.426063] RSP: 0018:ffffb9ec413d7c50 EFLAGS: 00010246
[  498.431245] RAX: ffffffffc03e9290 RBX: ffff9e9323745200 RCX: 0000000000000006
[  498.438326] RDX: 0000000000000000 RSI: ffff9e9323745200 RDI: 0000000000000000
[  498.445404] RBP: ffff9e9246426000 R08: 0000000000000000 R09: 00000000000009f9
[  498.452485] R10: 0000000000000007 R11: 0000000000000000 R12: ffff9e9323745200
[  498.459579] R13: 00000000ffffffff R14: ffff9e931a09cdc8 R15: ffff9e9246426000
[  498.466682]  ? port_assign+0x60/0x60 [i915]
[  498.470844]  i915_reset_engine+0x5e/0xf0 [i915]
[  498.475355]  i915_handle_error+0x21e/0x3e0 [i915]
[  498.480034]  ? vsnprintf+0x203/0x4d0
[  498.483582]  ? vscnprintf+0x9/0x20
[  498.486955]  ? scnprintf+0x49/0x70
[  498.490338]  hangcheck_declare_hang+0xd8/0x110 [i915]
[  498.495361]  ? fwtable_read32+0x83/0x190 [i915]
[  498.499864]  i915_hangcheck_elapsed+0x2cf/0x380 [i915]
[  498.504963]  process_one_work+0x141/0x340
[  498.508941]  worker_thread+0x47/0x3e0
[  498.512575]  kthread+0xfc/0x130
[  498.515690]  ? rescuer_thread+0x380/0x380
[  498.519667]  ? kthread_park+0x60/0x60
[  498.523301]  ? do_syscall_64+0x6f/0x1a0
[  498.527106]  ? SyS_exit_group+0x10/0x10
[  498.530912]  ret_from_fork+0x35/0x40
[  498.534458] Code: ff 80 8b ac 08 00 00 04 e9 2b ff ff ff 0f ff e9 c7 fe ff ff f7 83 84 00 00 00 fd ff ff ff 0f 84 d1 fe ff ff 0f ff e9 ca fe ff ff <0f> ff e9 d9 fe ff ff 0f 1f 44 00 00 0f 1f 44 00 00 41 55 49 89
[  498.553187] ---[ end trace e7c5283a77cf3e18 ]---
[  498.557769] sched: Unexpected reschedule of offline CPU#0!

	2. This kernel panic only reproduce on 4.14 LTS. And can't reproduce the drm-tip/kernel org latest version.
I also git bisect the fix patch from drm-tip, the git bisect log attached, and the fixed patch attached. 
Pls help to pay attention:
"good" means can reproduce this issue.
"bad" means can't reproduce this issue. it means fixed this issue.

	3. The fix patch comment in drm-tip:
commit 221ab9719bf33ad2984928d2afb20988d652a289
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Sep 16 21:44:14 2017 +0100

    drm/i915/execlists: Unwind incomplete requests on resets

    Given the mechanism to unwind and replay requests (designed to support
    preemption), we have an alternative to the current method of
    resubmitting the ELSP upon reset. Resubmitting ELSP turns out to be more
    complicated than expected, due to having to handle lost context-switch
    interrupts and so guessing what ELSP we need to resubmit later. Instead,
    by unwinding the requests and clearing the ELSP tracking entirely, we
    can then just dequeue the first pair of ready requests after resetting,
    using the normal submission procedure.

    Currently, the unwound requests have maximum priority and so are
    guaranteed to be resubmitted upon resume. If we are lucky, we may be
    able to coalesce a new request on top!

    Suggested-by: Michał Winiarski <michal.winiarski@intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20170916204414.32762-4-chris@chris-wilson.co.uk
    Reviewed-by: Michał Winiarski <michal.winiarski@intel.com>

	4. The fix patch in kernel org:
 https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/commit/drivers/gpu/drm/i915/intel_lrc.c?h=v4.16.13&id=221ab9719bf33ad2984928d2afb20988d652a289

	5. The reproduce steps:
  1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
	2) Run the following case on one terminal.
	./repro.sh
	3) Meanwhile run the following cmd in another terminal:
	echo 15 > /sys/kernel/debug/dri/0/i915_wedged
Comment 1 Owen Zhang 2018-06-06 05:43:03 UTC
Created attachment 140043 [details]
git bisect log
Comment 2 Owen Zhang 2018-06-06 05:43:31 UTC
Created attachment 140044 [details]
fix patch by bisect
Comment 3 Owen Zhang 2018-06-06 05:44:42 UTC
Created attachment 140045 [details]
console log
Comment 4 Owen Zhang 2018-06-06 05:46:45 UTC
we have bisected the fix patch, but this patch can't backport 4.14 LTS directly, since this patch has some dependences. so we need one fix patch for 4.14 LTS. thanks very much.
Comment 5 Chris Wilson 2018-06-06 07:35:49 UTC
Simpler patch; remove debugfs/i915_wedged.
Comment 6 Joonas Lahtinen 2018-06-06 10:46:57 UTC
As discussed, the resulting logs need to be from an actual real-world use case, so that we can make sure we're working to fix the actual issue, not just what appears to be a similar synthetic use case through debugfs.

And that use case needs to be valid for the LTS to consider backporting non-trivial patches. At the time of any kernel release, we can't really predict all the future usecases but expect testing to take place during development window and release candidates. For LTS the promise is to keep supporting the existing use cases per name (long-term support). It's not long-term development kernel, where we introduce stuff that had never been thought of before.

The referred to media solution here did not even exist in Open Source at the time of 4.14 development, so it should not come as a surprise that a kernel preceding it in time is not supporting it out of the box.

We continually improve the upstream driver, but we can't duplicate all the improvements into the LTS/stable kernels just because "they might help" some userspace that comes in the future. Because that would make the LTS kernel equal the drm-tip and take away the benefits of the LTS release for the users who have provided testing during the development and release candidates, and are relying on the stability of that tree for the usecase they have. That is very much the danger with the patches like bisected here, which literally changes how the driver interacts with the hardware.

But, after seeing the real world use case, we can still actually *try* to see what can be done to mitigate the issue with minimal changes.

Thus, the reporter has been instructed to re-open a bug with a real use case and description of that issue.
Comment 7 Owen Zhang 2018-07-06 02:26:41 UTC
Created attachment 140479 [details] [review]
draft one fix patch on kernel 4.14

Hi Joonas and all,

Could you help to have a look this fix on 4.14 LTS?
due to we need one fix on our MSS release, so we made one fix patch based on 4.14. this changes can avoid the kernel panic.
pls kindly give your any comments and suggestions? thanks very much.

here coped the changes, Appreciate your comments.

---
 drivers/gpu/drm/i915/i915_gem.c  |  6 ++++--
 drivers/gpu/drm/i915/intel_lrc.c | 17 +++++++++++------
 2 files changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 877e3b6..45f8130 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -3058,8 +3058,10 @@ static void engine_set_wedged(struct intel_engine_cs *engine)
 		for (n = 0; n < ARRAY_SIZE(engine->execlist_port); n++) {
 			struct drm_i915_gem_request *rq = port_request(&port[n]);
 
-			intel_engine_context_out(rq->engine);
-			i915_gem_request_put(rq);
+			if(rq) {
+				intel_engine_context_out(rq->engine);
+				i915_gem_request_put(rq);
+			}
 		}
 		memset(engine->execlist_port, 0, sizeof(engine->execlist_port));
 		engine->execlist_queue = RB_ROOT;
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 50e9641..e620b4d 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1411,13 +1411,18 @@ static void reset_common_ring(struct intel_engine_cs *engine,
 		return;
 	}
 
-	if (request->ctx != port_request(port)->ctx) {
-		i915_gem_request_put(port_request(port));
-		port[0] = port[1];
-		memset(&port[1], 0, sizeof(port[1]));
-	}
+	/* if the request in ELSP had been processed after GPU finished
+	 * HW reset, this GPU reset occur when a forced request.
+	 * */
+	if (port_request(port)) {
+		if (request->ctx != port_request(port)->ctx) {
+			i915_gem_request_put(port_request(port));
+			port[0] = port[1];
+			memset(&port[1], 0, sizeof(port[1]));
+		}
 
-	GEM_BUG_ON(request->ctx != port_request(port)->ctx);
+		GEM_BUG_ON(request->ctx != port_request(port)->ctx);
+	}
 
 	/* If the request was innocent, we leave the request in the ELSP
 	 * and will try to replay it on restarting. The context image may
Comment 8 Rodrigo Vivi 2018-07-06 17:15:02 UTC
I'm afraid Chris already gave the right solution for an LTS on comment 5.
Everything else is ducktape not suitable for an LTS imho.
Comment 9 Owen Zhang 2018-07-07 01:15:38 UTC
(In reply to Rodrigo Vivi from comment #8)
> I'm afraid Chris already gave the right solution for an LTS on comment 5.
> Everything else is ducktape not suitable for an LTS imho.

thanks very much, but i haven't found Chris's patch in LTS repo, Could you tell how can i get it? thanks a lot.
Comment 10 Rodrigo Vivi 2018-07-09 21:29:44 UTC
He gave a suggestion of patch, not an actual patch.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.