Bug 102221 - [SKL] GPU HANG on rcs0 on Intel i915 on drm-tip
Summary: [SKL] GPU HANG on rcs0 on Intel i915 on drm-tip
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-14 17:39 UTC by Igor Zinovyev
Modified: 2018-04-04 10:41 UTC (History)
3 users (show)

See Also:
i915 platform: SKL
i915 features: firmware/dmc


Attachments
Dump of dmesg (1.29 MB, text/x-log)
2017-08-14 17:39 UTC, Igor Zinovyev
no flags Details
/sys/class/drm/card0/error (16.11 KB, text/plain)
2017-08-14 17:40 UTC, Igor Zinovyev
no flags Details
error report generated (594.87 KB, text/plain)
2017-10-22 20:44 UTC, Cedric Brandenbourger
no flags Details

Description Igor Zinovyev 2017-08-14 17:39:14 UTC
Created attachment 133505 [details]
Dump of dmesg

I got a GPU HANG when leaving the laptop idle with an external Thunderbolt monitor connected via the USB-C port. This time I was able to SSH into the machine and get the error state out, will try to reproduce it again.

I'm running a kernel build I have compiled today from drm-tip, the latest commit in the tree for me is: 

commit 7a620d02bd0a7015fe8f6fc8ae830d47b101394d
Author: Mika Kuoppala <mika.kuoppala@intel.com>
Date:   Mon Aug 14 12:27:48 2017 +0300

    drm-tip: 2017y-08m-14d-09h-26m-51s UTC integration manifest

I'm running Ubuntu on a Dell Precision 5510 with 4.13.0-rc4+ x86_64. It's an Intel GPU machine with:

00:02.0 VGA compatible controller: Intel Corporation HD Graphics P530 (rev 06)

Just recently I have updated the BIOS to the latest version, 1.2.29, but it doesn't seem to be related: http://www.dell.com/support/home/en/en/rubsdc/drivers/driversdetails?driverId=N1W4N&fileId=3696502611&osCode=WT64A&productCode=precision-m5510-workstation&languageCode=ru&categoryId=BI

I'm attaching full dmesg and the error state to this bug, but I haven't been running the machine with drm.debug=0x1e. Please let me know if I need to do that.
Comment 1 Igor Zinovyev 2017-08-14 17:40:08 UTC
Created attachment 133506 [details]
/sys/class/drm/card0/error
Comment 2 Chris Wilson 2017-08-14 19:51:54 UTC
No, extra drm.debug is not required. Looks like it didn't send the final context-switch interrupt.

Reasonable suspicion laid on

Aug 14 18:51:41 precision kernel: [ 7196.525920] DC6 already programmed to be enabled.
Aug 14 18:51:41 precision kernel: [ 7196.525947] ------------[ cut here ]------------
Aug 14 18:51:41 precision kernel: [ 7196.525981] WARNING: CPU: 6 PID: 13635 at drivers/gpu/drm/i915/intel_runtime_pm.c:606 skl_enable_dc6+0x9f/0xb0 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.525981] Modules linked in: snd_usb_toneport snd_usb_line6 rfcomm ccm cmac bnep hid_multitouch snd_hda_codec_hdmi nls_iso8859_1 dell_rbtn dell_laptop snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass joydev dell_wmi dell_smbios serio_raw dcdbas wmi_bmof snd_hda_intel snd_hda_codec snd_hda_core snd_usb_audio snd_usbmidi_lib snd_hwdep snd_pcm iwlmvm snd_seq_midi snd_seq_midi_event thunderbolt nvmem_core snd_rawmidi mac80211 rtsx_pci_ms memstick snd_seq uvcvideo videobuf2_vmalloc snd_seq_device videobuf2_memops snd_timer videobuf2_v4l2 input_leds videobuf2_core videodev snd media usblp soundcore iwlwifi mei_me btusb mei idma64 btrtl intel_pch_thermal intel_lpss_pci processor_thermal_device intel_soc_dts_iosf shpchp ie31200_edac
Aug 14 18:51:41 precision kernel: [ 7196.526000]  hci_uart btbcm serdev btqca int3403_thermal btintel bluetooth ecdh_generic intel_lpss_acpi dell_smo8800 intel_lpss int3402_thermal int340x_thermal_zone int3400_thermal mac_hid acpi_pad acpi_thermal_rel intel_hid parport_pc ppdev lp parport efivarfs autofs4 btrfs xor raid6_pq algif_skcipher af_alg dm_crypt dm_mirror dm_region_hash dm_log rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc nouveau i915 aesni_intel aes_x86_64 crypto_simd glue_helper cryptd firewire_ohci psmouse mxm_wmi ttm firewire_core prime_numbers i2c_algo_bit crc_itu_t drm_kms_helper syscopyarea sysfillrect nvme sysimgblt fb_sys_fops nvme_core rtsx_pci drm i2c_hid wmi pinctrl_sunrisepoint pinctrl_intel [last unloaded: snd_usb_line6]
Aug 14 18:51:41 precision kernel: [ 7196.526060] CPU: 6 PID: 13635 Comm: kworker/u16:1 Tainted: G        W       4.13.0-rc4+ #4
Aug 14 18:51:41 precision kernel: [ 7196.526061] Hardware name: Dell Inc. Precision 5510/08R8KJ, BIOS 1.2.29 07/24/2017
Aug 14 18:51:41 precision kernel: [ 7196.526078] Workqueue: i915-dp i915_digport_work_func [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526079] task: ffff8a6a6db55d00 task.stack: ffffb595c72ec000
Aug 14 18:51:41 precision kernel: [ 7196.526089] RIP: 0010:skl_enable_dc6+0x9f/0xb0 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526090] RSP: 0018:ffffb595c72efd48 EFLAGS: 00010286
Aug 14 18:51:41 precision kernel: [ 7196.526091] RAX: 0000000000000025 RBX: ffff8a6b938c8000 RCX: 0000000000000000
Aug 14 18:51:41 precision kernel: [ 7196.526091] RDX: 0000000000000000 RSI: ffff8a6bbdd8cc38 RDI: ffff8a6bbdd8cc38
Aug 14 18:51:41 precision kernel: [ 7196.526092] RBP: ffffb595c72efd50 R08: 00000000000029f0 R09: 0000000000000004
Aug 14 18:51:41 precision kernel: [ 7196.526092] R10: 0000000000000040 R11: 0000000000000001 R12: ffff8a6b938c8000
Aug 14 18:51:41 precision kernel: [ 7196.526093] R13: ffff8a6b938ccbc0 R14: ffffffffc04613f8 R15: 0000000020000000
Aug 14 18:51:41 precision kernel: [ 7196.526093] FS:  0000000000000000(0000) GS:ffff8a6bbdd80000(0000) knlGS:0000000000000000
Aug 14 18:51:41 precision kernel: [ 7196.526094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 14 18:51:41 precision kernel: [ 7196.526094] CR2: 000000000337b010 CR3: 000000073a60a000 CR4: 00000000003406e0
Aug 14 18:51:41 precision kernel: [ 7196.526095] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Aug 14 18:51:41 precision kernel: [ 7196.526095] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Aug 14 18:51:41 precision kernel: [ 7196.526096] Call Trace:
Aug 14 18:51:41 precision kernel: [ 7196.526106]  gen9_dc_off_power_well_disable+0x24/0x30 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526135]  intel_power_well_disable+0x39/0x40 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526144]  intel_display_power_put+0xad/0x110 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526159]  intel_dp_hpd_pulse+0x15e/0x300 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526172]  i915_digport_work_func+0x85/0xf0 [i915]
Aug 14 18:51:41 precision kernel: [ 7196.526174]  process_one_work+0x1d6/0x3d0
Aug 14 18:51:41 precision kernel: [ 7196.526175]  worker_thread+0x42/0x3e0
Aug 14 18:51:41 precision kernel: [ 7196.526177]  kthread+0x11f/0x140
Aug 14 18:51:41 precision kernel: [ 7196.526178]  ? trace_event_raw_event_workqueue_execute_start+0xb0/0xb0
Aug 14 18:51:41 precision kernel: [ 7196.526179]  ? kthread_create_on_node+0x60/0x60
Aug 14 18:51:41 precision kernel: [ 7196.526181]  ret_from_fork+0x22/0x30
Aug 14 18:51:41 precision kernel: [ 7196.526182] Code: 05 35 1b 13 00 01 e8 3d 05 56 f8 0f ff eb 99 80 3d 24 1b 13 00 00 75 a7 48 c7 c7 00 11 46 c0 c6 05 14 1b 13 00 01 e8 1d 05 56 f8 <0f> ff eb 90 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 83 bf 40
Aug 14 18:51:41 precision kernel: [ 7196.526196] ---[ end trace db26e1435af3d97b ]---
Aug 14 18:51:41 precision kernel: [ 7196.526476] [drm:gen9_set_dc_state [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
Aug 14 18:52:41 precision kernel: [ 7255.928570] [drm:gen9_set_dc_state [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
Aug 14 18:53:37 precision kernel: [ 7312.530599] [drm:gen9_set_dc_state [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
Comment 3 Elizabeth 2017-08-14 20:13:55 UTC
Adding tag into "Whiteboard" field - ReadyForDev
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 4 Imre Deak 2017-08-16 12:27:07 UTC
(In reply to Chris Wilson from comment #2)
> No, extra drm.debug is not required. Looks like it didn't send the final
> context-switch interrupt.
> 
> Reasonable suspicion laid on

Looks like a known DMC firmware bug, where toggling DC6 enabled state can corrupt registers backed by DC6 power context. There is an internal bug ticket opened for this, I'm planning to provide more debug info to the firmware team and convince them to fix it.

One register that can get corrupted is GEN8_MASTER_IRQ leaving all i915 interrupts disabled, that would also explain the missing ctx switch interrupt.

> 
> Aug 14 18:51:41 precision kernel: [ 7196.525920] DC6 already programmed to
> be enabled.
> Aug 14 18:51:41 precision kernel: [ 7196.525947] ------------[ cut here
> ]------------
> Aug 14 18:51:41 precision kernel: [ 7196.525981] WARNING: CPU: 6 PID: 13635
> at drivers/gpu/drm/i915/intel_runtime_pm.c:606 skl_enable_dc6+0x9f/0xb0
> [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.525981] Modules linked in:
> snd_usb_toneport snd_usb_line6 rfcomm ccm cmac bnep hid_multitouch
> snd_hda_codec_hdmi nls_iso8859_1 dell_rbtn dell_laptop snd_hda_codec_realtek
> snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm irqbypass joydev dell_wmi dell_smbios serio_raw
> dcdbas wmi_bmof snd_hda_intel snd_hda_codec snd_hda_core snd_usb_audio
> snd_usbmidi_lib snd_hwdep snd_pcm iwlmvm snd_seq_midi snd_seq_midi_event
> thunderbolt nvmem_core snd_rawmidi mac80211 rtsx_pci_ms memstick snd_seq
> uvcvideo videobuf2_vmalloc snd_seq_device videobuf2_memops snd_timer
> videobuf2_v4l2 input_leds videobuf2_core videodev snd media usblp soundcore
> iwlwifi mei_me btusb mei idma64 btrtl intel_pch_thermal intel_lpss_pci
> processor_thermal_device intel_soc_dts_iosf shpchp ie31200_edac
> Aug 14 18:51:41 precision kernel: [ 7196.526000]  hci_uart btbcm serdev
> btqca int3403_thermal btintel bluetooth ecdh_generic intel_lpss_acpi
> dell_smo8800 intel_lpss int3402_thermal int340x_thermal_zone int3400_thermal
> mac_hid acpi_pad acpi_thermal_rel intel_hid parport_pc ppdev lp parport
> efivarfs autofs4 btrfs xor raid6_pq algif_skcipher af_alg dm_crypt dm_mirror
> dm_region_hash dm_log rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul
> crc32c_intel ghash_clmulni_intel pcbc nouveau i915 aesni_intel aes_x86_64
> crypto_simd glue_helper cryptd firewire_ohci psmouse mxm_wmi ttm
> firewire_core prime_numbers i2c_algo_bit crc_itu_t drm_kms_helper
> syscopyarea sysfillrect nvme sysimgblt fb_sys_fops nvme_core rtsx_pci drm
> i2c_hid wmi pinctrl_sunrisepoint pinctrl_intel [last unloaded: snd_usb_line6]
> Aug 14 18:51:41 precision kernel: [ 7196.526060] CPU: 6 PID: 13635 Comm:
> kworker/u16:1 Tainted: G        W       4.13.0-rc4+ #4
> Aug 14 18:51:41 precision kernel: [ 7196.526061] Hardware name: Dell Inc.
> Precision 5510/08R8KJ, BIOS 1.2.29 07/24/2017
> Aug 14 18:51:41 precision kernel: [ 7196.526078] Workqueue: i915-dp
> i915_digport_work_func [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526079] task: ffff8a6a6db55d00
> task.stack: ffffb595c72ec000
> Aug 14 18:51:41 precision kernel: [ 7196.526089] RIP:
> 0010:skl_enable_dc6+0x9f/0xb0 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526090] RSP: 0018:ffffb595c72efd48
> EFLAGS: 00010286
> Aug 14 18:51:41 precision kernel: [ 7196.526091] RAX: 0000000000000025 RBX:
> ffff8a6b938c8000 RCX: 0000000000000000
> Aug 14 18:51:41 precision kernel: [ 7196.526091] RDX: 0000000000000000 RSI:
> ffff8a6bbdd8cc38 RDI: ffff8a6bbdd8cc38
> Aug 14 18:51:41 precision kernel: [ 7196.526092] RBP: ffffb595c72efd50 R08:
> 00000000000029f0 R09: 0000000000000004
> Aug 14 18:51:41 precision kernel: [ 7196.526092] R10: 0000000000000040 R11:
> 0000000000000001 R12: ffff8a6b938c8000
> Aug 14 18:51:41 precision kernel: [ 7196.526093] R13: ffff8a6b938ccbc0 R14:
> ffffffffc04613f8 R15: 0000000020000000
> Aug 14 18:51:41 precision kernel: [ 7196.526093] FS:  0000000000000000(0000)
> GS:ffff8a6bbdd80000(0000) knlGS:0000000000000000
> Aug 14 18:51:41 precision kernel: [ 7196.526094] CS:  0010 DS: 0000 ES: 0000
> CR0: 0000000080050033
> Aug 14 18:51:41 precision kernel: [ 7196.526094] CR2: 000000000337b010 CR3:
> 000000073a60a000 CR4: 00000000003406e0
> Aug 14 18:51:41 precision kernel: [ 7196.526095] DR0: 0000000000000000 DR1:
> 0000000000000000 DR2: 0000000000000000
> Aug 14 18:51:41 precision kernel: [ 7196.526095] DR3: 0000000000000000 DR6:
> 00000000fffe0ff0 DR7: 0000000000000400
> Aug 14 18:51:41 precision kernel: [ 7196.526096] Call Trace:
> Aug 14 18:51:41 precision kernel: [ 7196.526106] 
> gen9_dc_off_power_well_disable+0x24/0x30 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526135] 
> intel_power_well_disable+0x39/0x40 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526144] 
> intel_display_power_put+0xad/0x110 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526159] 
> intel_dp_hpd_pulse+0x15e/0x300 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526172] 
> i915_digport_work_func+0x85/0xf0 [i915]
> Aug 14 18:51:41 precision kernel: [ 7196.526174] 
> process_one_work+0x1d6/0x3d0
> Aug 14 18:51:41 precision kernel: [ 7196.526175]  worker_thread+0x42/0x3e0
> Aug 14 18:51:41 precision kernel: [ 7196.526177]  kthread+0x11f/0x140
> Aug 14 18:51:41 precision kernel: [ 7196.526178]  ?
> trace_event_raw_event_workqueue_execute_start+0xb0/0xb0
> Aug 14 18:51:41 precision kernel: [ 7196.526179]  ?
> kthread_create_on_node+0x60/0x60
> Aug 14 18:51:41 precision kernel: [ 7196.526181]  ret_from_fork+0x22/0x30
> Aug 14 18:51:41 precision kernel: [ 7196.526182] Code: 05 35 1b 13 00 01 e8
> 3d 05 56 f8 0f ff eb 99 80 3d 24 1b 13 00 00 75 a7 48 c7 c7 00 11 46 c0 c6
> 05 14 1b 13 00 01 e8 1d 05 56 f8 <0f> ff eb 90 0f 1f 00 66 2e 0f 1f 84 00 00
> 00 00 00 48 83 bf 40
> Aug 14 18:51:41 precision kernel: [ 7196.526196] ---[ end trace
> db26e1435af3d97b ]---
> Aug 14 18:51:41 precision kernel: [ 7196.526476] [drm:gen9_set_dc_state
> [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
> Aug 14 18:52:41 precision kernel: [ 7255.928570] [drm:gen9_set_dc_state
> [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
> Aug 14 18:53:37 precision kernel: [ 7312.530599] [drm:gen9_set_dc_state
> [i915]] *ERROR* DC state mismatch (0x0 -> 0x2)
Comment 5 Cedric Brandenbourger 2017-10-22 20:43:48 UTC
Getting this error when playing a h264 in totem under ubuntu 17.10 (kernel 4.13)
Comment 6 Cedric Brandenbourger 2017-10-22 20:44:58 UTC
Created attachment 134997 [details]
error report generated
Comment 7 Elizabeth 2017-10-25 16:27:20 UTC
(In reply to Imre Deak from comment #4)
> (In reply to Chris Wilson from comment #2)
> > No, extra drm.debug is not required. Looks like it didn't send the final
> > context-switch interrupt.
> > 
> > Reasonable suspicion laid on
> 
> Looks like a known DMC firmware bug, where toggling DC6 enabled state can
> corrupt registers backed by DC6 power context. There is an internal bug
> ticket opened for this, I'm planning to provide more debug info to the
> firmware team and convince them to fix it.
> 
> One register that can get corrupted is GEN8_MASTER_IRQ leaving all i915
> interrupts disabled, that would also explain the missing ctx switch
> interrupt.
> ...
Hello Imre, any relevant update for this bug that can be shared at FDO? Thank you.
Comment 8 Imre Deak 2017-10-26 10:17:08 UTC
(In reply to Elizabeth from comment #7)
> (In reply to Imre Deak from comment #4)
> > (In reply to Chris Wilson from comment #2)
> > > No, extra drm.debug is not required. Looks like it didn't send the final
> > > context-switch interrupt.
> > > 
> > > Reasonable suspicion laid on
> > 
> > Looks like a known DMC firmware bug, where toggling DC6 enabled state can
> > corrupt registers backed by DC6 power context. There is an internal bug
> > ticket opened for this, I'm planning to provide more debug info to the
> > firmware team and convince them to fix it.
> > 
> > One register that can get corrupted is GEN8_MASTER_IRQ leaving all i915
> > interrupts disabled, that would also explain the missing ctx switch
> > interrupt.
> > ...
> Hello Imre, any relevant update for this bug that can be shared at FDO?
> Thank you.

The firmware version with the fix is planned to be released next week, so stay tuned:)
Comment 9 Jani Saarinen 2018-03-29 07:10:50 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 10 Igor Zinovyev 2018-03-29 08:15:52 UTC
I'm currently running drm-tip on this commit:

commit d439f4eca05fe48c26bf9c3863e56ee19ac1c50b
Author: Rodrigo Vivi <rodrigo.vivi@intel.com>
Date:   Mon Mar 26 17:17:45 2018 -0700

drm-tip: 2018y-03m-27d-00h-15m-56s UTC integration manifest

Looking back at a couple weeks worth of kernel logs, I can not find any GPU HANG records anymore. Looks like this can be closed. Thanks!

P.S. Not sure if I should mark it as RESOLVED though, please do whatever you feel is necessary.
Comment 11 Jani Saarinen 2018-03-29 08:26:30 UTC
I appreciate feedback, thanks. Resolving. Please re-open if issues still.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.