92545 – [BSW] GPU Hang leads to sporadic kernel crashes

Bug 92545 - [BSW] GPU Hang leads to sporadic kernel crashes

Summary: [BSW] GPU Hang leads to sporadic kernel crashes

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) other

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-10-19 19:27 UTC by Dhinakaran Pandiyan
Modified:	2018-01-05 17:05 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	BSW/CHT
i915 features:	GPU hang

Attachments
dmesg (109.93 KB, text/plain) 2015-10-19 19:28 UTC, Dhinakaran Pandiyan	no flags	Details
View All

Description Dhinakaran Pandiyan 2015-10-19 19:27:10 UTC

Kernel crashes after a GPU reset. The GPU hang is frequent and happens during boot time. However, the GPU hang occasionally results in a kernel crash. This has been observed on Chrome OS with a 3.18 kernel that has i915 backports.

I believe that the NULL pointer access happens at I915_WRITE(DSPSURF(intel_crtc->plane), intel_crtc->unpin_work->gtt_offset); in intel_display.c:ilk_do_mmio_flip
If we assume an ongoing reset, then the call sequence
intel_finish_reset -> intel_complete_page_flips -> intel_finish_page_flip_plane -> do_intel_finish_page_flip -> page_flip_completed
might set intel_crtc->unpin_work = NULL.

We need some help to debug this crash.

<6>[    6.744129] [drm] stuck on render ring
<6>[    6.766343] [drm] GPU HANG: ecode 8:0:0x2efe5dbc, reason: Ring hung, action: reset
<6>[    6.766356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[    6.766367] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
<6>[    6.766378] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[    6.766389] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[    6.766400] [drm] GPU crash dump saved to /sys/class/drm/card0/error
<5>[    6.769207] drm/i915: Resetting chip after gpu hang
<6>[   12.739947] [drm] stuck on render ring
<6>[   12.765654] [drm] GPU HANG: ecode 8:0:0x86dffffd, in chrome [3652], reason: Ring hung, action: reset
<4>[   12.765733] ------------[ cut here ]------------
<4>[   12.765764] WARNING: CPU: 2 PID: 41 at /mnt/host/source/src/third_party/kernel/v3.18/drivers/gpu/drm/i915/intel_display.c:11277 intel_mmio_flip_work_func+0x6d/0x315()
<4>[   12.765787] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
<4>[   12.765805] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 cros_ec_sensors ip6table_filter cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev ppp_async ppp_generic slhc tun
<4>[   12.765930] CPU: 2 PID: 41 Comm: kworker/2:1 Not tainted 3.18.0-06623-g902cb99 #1
<4>[   12.765944] Hardware name: GOOGLE Cyan, BIOS Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   12.765962] Workqueue: events intel_mmio_flip_work_func
<4>[   12.765974]  0000000000000000 000000004607d413 ffff88017a9bbcc8 ffffffff8d5f3d15
<4>[   12.765996]  0000000000000000 ffff88017a9bbd20 ffff88017a9bbd08 ffffffff8d03dfd9
<4>[   12.766016]  ffff88017a9bbcd8 ffffffff8d345524 ffff88017a97f000 ffff880072f2da00
<4>[   12.766037] Call Trace:
<4>[   12.766054]  [<ffffffff8d5f3d15>] ? dump_stack+0x46/0x58
<4>[   12.766070]  [<ffffffff8d03dfd9>] ? warn_slowpath_common+0x81/0x9b
<4>[   12.766085]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   12.766100]  [<ffffffff8d03e048>] ? warn_slowpath_fmt+0x55/0x6b
<4>[   12.766115]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   12.766133]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   12.766149]  [<ffffffff8d051a1b>] ? process_one_work+0x175/0x2ab
<4>[   12.766163]  [<ffffffff8d052c95>] ? worker_thread+0x1fb/0x2ce
<4>[   12.766178]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   12.766192]  [<ffffffff8d056863>] ? kthread+0x10e/0x116
<4>[   12.766207]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   12.766222]  [<ffffffff8d5f8bac>] ? ret_from_fork+0x7c/0xb0
<4>[   12.766237]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   12.766249] ---[ end trace 8d614c29c562a829 ]---
<5>[   12.767790] drm/i915: Resetting chip after gpu hang
<6>[   18.740012] [drm] stuck on render ring
<6>[   18.760304] [drm] GPU HANG: ecode 8:0:0x86dffffd, in chrome [3652], reason: Ring hung, action: reset
<4>[   18.760635] ------------[ cut here ]------------
<4>[   18.760665] WARNING: CPU: 0 PID: 1099 at /mnt/host/source/src/third_party/kernel/v3.18/drivers/gpu/drm/i915/intel_display.c:11277 intel_mmio_flip_work_func+0x6d/0x315()
<4>[   18.760688] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
<4>[   18.760706] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 cros_ec_sensors ip6table_filter cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev ppp_async ppp_generic slhc tun
<4>[   18.760823] CPU: 0 PID: 1099 Comm: kworker/0:2 Tainted: G        W      3.18.0-06623-g902cb99 #1
<4>[   18.760837] Hardware name: GOOGLE Cyan, BIOS Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   18.760854] Workqueue: events intel_mmio_flip_work_func
<4>[   18.760866]  0000000000000000 00000000e8ee967d ffff8801760b3cc8 ffffffff8d5f3d15
<4>[   18.760886]  0000000000000000 ffff8801760b3d20 ffff8801760b3d08 ffffffff8d03dfd9
<4>[   18.760905]  ffff8801760b3cd8 ffffffff8d345524 ffff8801799ceb40 ffff8801798a50c0
<4>[   18.760925] Call Trace:
<4>[   18.760940]  [<ffffffff8d5f3d15>] ? dump_stack+0x46/0x58
<4>[   18.760955]  [<ffffffff8d03dfd9>] ? warn_slowpath_common+0x81/0x9b
<4>[   18.760969]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   18.760983]  [<ffffffff8d03e048>] ? warn_slowpath_fmt+0x55/0x6b
<4>[   18.760997]  [<ffffffff8d345524>] ? intel_mmio_flip_work_func+0x6d/0x315
<4>[   18.761014]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   18.761028]  [<ffffffff8d051a1b>] ? process_one_work+0x175/0x2ab
<4>[   18.761042]  [<ffffffff8d052c95>] ? worker_thread+0x1fb/0x2ce
<4>[   18.761055]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   18.761069]  [<ffffffff8d056863>] ? kthread+0x10e/0x116
<4>[   18.761083]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.761096]  [<ffffffff8d5f8bac>] ? ret_from_fork+0x7c/0xb0
<4>[   18.761110]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.761121] ---[ end trace 8d614c29c562a82a ]---
<5>[   18.763443] drm/i915: Resetting chip after gpu hang
<1>[   18.769490] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
<1>[   18.769515] IP: [<ffffffff8d345716>] intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.769536] PGD 0 
<4>[   18.769544] Oops: 0000 [#1] SMP 
<0>[   18.773130] gsmi: Log Shutdown Reason 0x03
<4>[   18.773140] Modules linked in: nf_conntrack_ipv6 nf_defrag_ipv6 cros_ec_sensors ip6table_filter cros_ec_sensors_core industrialio_triggered_buffer kfifo_buf ip6_tables iio_trig_sysfs industrialio iwlmvm iwl7000_mac80211 iwlwifi cfg80211 btusb btbcm btintel bluetooth smsc95xx usbnet mii uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core joydev ppp_async ppp_generic slhc tun
<4>[   18.773241] CPU: 0 PID: 1099 Comm: kworker/0:2 Tainted: G        W      3.18.0-06623-g902cb99 #1
<4>[   18.773255] Hardware name: GOOGLE Cyan, BIOS Google_Cyan.7287.57.2015_09_30_1147 09/30/2015
<4>[   18.773272] Workqueue: events intel_mmio_flip_work_func
<4>[   18.773283] task: ffff880179bfea80 ti: ffff8801760b0000 task.ti: ffff8801760b0000
<4>[   18.773296] RIP: 0010:[<ffffffff8d345716>]  [<ffffffff8d345716>] intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.773314] RSP: 0018:ffff8801760b3d88  EFLAGS: 00010096
<4>[   18.773324] RAX: 0000000000000000 RBX: ffff88017b2b7000 RCX: 0000000000180000
<4>[   18.773337] RDX: 00000000001e1180 RSI: 0000000000000046 RDI: ffff88017a080000
<4>[   18.773349] RBP: ffff8801760b3de8 R08: 0000000000000001 R09: ffff88017b2b7000
<4>[   18.773361] R10: 0000000000000000 R11: 000000000000b910 R12: ffff88017a080000
<4>[   18.773373] R13: 00000000001f0180 R14: ffff8801798a50c0 R15: ffff8801741c5680
<4>[   18.773386] FS:  0000000000000000(0000) GS:ffff88017fc00000(0000) knlGS:0000000000000000
<4>[   18.773399] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>[   18.773410] CR2: 0000000000000048 CR3: 0000000077e06000 CR4: 00000000001007f0
<4>[   18.773422] Stack:
<4>[   18.773427]  ffff8801760b3db8 ffffffff8d05c849 ffff88017a99c000 ffff880078ac0b40
<4>[   18.773446]  000003dc77f04c10 00000000e8ee967d ffff8801760b3e28 ffff8801799ceb40
<4>[   18.773463]  ffff8801798a50c0 ffff88017fc11780 0000000000000000 ffff88017fc15b00
<4>[   18.773481] Call Trace:
<4>[   18.773495]  [<ffffffff8d05c849>] ? finish_task_switch+0x5b/0xba
<4>[   18.773510]  [<ffffffff8d051a1b>] process_one_work+0x175/0x2ab
<4>[   18.773523]  [<ffffffff8d052c95>] worker_thread+0x1fb/0x2ce
<4>[   18.773535]  [<ffffffff8d052a9a>] ? rescuer_thread+0x2d7/0x2d7
<4>[   18.773548]  [<ffffffff8d056863>] kthread+0x10e/0x116
<4>[   18.773561]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.773575]  [<ffffffff8d5f8bac>] ret_from_fork+0x7c/0xb0
<4>[   18.773587]  [<ffffffff8d056755>] ? kthread_stop+0xc0/0xc0
<4>[   18.773597] Code: 00 c0 74 05 80 cc 04 89 c2 b9 01 00 00 00 4c 89 ee 4c 89 e7 41 ff 94 24 d8 00 00 00 48 8b 83 48 07 00 00 41 8b 4c 24 20 4c 89 e7 <8b> 50 48 8b 83 24 04 00 00 41 8b 44 84 30 41 2b 44 24 30 8d b4 
<1>[   18.773724] RIP  [<ffffffff8d345716>] intel_mmio_flip_work_func+0x25f/0x315
<4>[   18.773739]  RSP <ffff8801760b3d88>
<4>[   18.773747] CR2: 0000000000000048
<4>[   18.773756] ---[ end trace 8d614c29c562a82b ]---
<0>[   18.781967] Kernel panic - not syncing: Fatal exception
<0>[   18.782089] Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
<0>[   18.782293] gsmi: Log Shutdown Reason 0x02

Comment 1 Dhinakaran Pandiyan 2015-10-19 19:28:19 UTC

Created attachment 118993 [details]
dmesg

Comment 2 yann 2016-09-21 15:59:17 UTC

There were improvements pushed in kernel and Mesa that will benefit to your system and certainly fix your issue, so please re-test with latest kernel & Mesa to see if this issue is still occurring.
In this last case, please attached as well gpu crash dump located at /sys/class/drm/card0/error

Comment 3 Dhinakaran Pandiyan 2016-09-21 18:29:03 UTC

iirc this issue was solved. I will go ahead and close this.

Comment 4 Dongseong Hwang 2017-11-22 03:31:44 UTC

ChromeOS still has this issue. https://bugs.chromium.org/p/chromium/issues/detail?id=776613

Dhinakaran, why did you consider it's fixed?

Comment 5 Dhinakaran Pandiyan 2017-11-22 23:12:43 UTC

DS,

I filed that bug over two years ago and as far as I can remember some backports made the GPU hang go away. 

I'd recommend filing a new bug for the issue you are seeing now if it's reproducible on drm-tip and/or talk to someone who's familiar with GPU hangs.

Comment 6 Elizabeth 2017-11-24 22:20:43 UTC

(In reply to Dhinakaran Pandiyan from comment #5)
> DS,
> 
> I filed that bug over two years ago and as far as I can remember some
> backports made the GPU hang go away. 
> 
> I'd recommend filing a new bug for the issue you are seeing now if it's
> reproducible on drm-tip and/or talk to someone who's familiar with GPU hangs.
Hello Dongseong Hwang, please file a new bug for your case, since Dhinakaran stated that his issue was different to https://bugs.chromium.org/p/chromium/issues/detail?id=776613

Comment 7 Dongseong Hwang 2017-11-30 02:36:22 UTC

For the record, it was fixed by 
https://patchwork.freedesktop.org/patch/106110/

Comment 8 Elizabeth 2018-01-05 17:05:53 UTC

As reference:
https://patchwork.freedesktop.org/patch/104303/
commit 3e7d28b655aefefe51f1d7ac6aba46d6ca03b658
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Thu Jan 4 14:45:54 2018 -0800

    drm-tip: 2018y-01m-04d-22h-45m-20s UTC integration manifest

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.