Summary: | [BAT SKL DMC] GPU death starting with first symptom: WARNING backtrace: "DC6 already programmed to be enabled." | ||
---|---|---|---|
Product: | DRI | Reporter: | Mika Kuoppala <mika.kuoppala> |
Component: | DRM/Intel | Assignee: | Patrik Jakobsson <patrik.r.jakobsson> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | highest | CC: | daniel, gary.c.wang, intel-gfx-bugs, rodrigo.vivi, zinigor+freedesktop |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | SKL | i915 features: | power/Other |
Description
Mika Kuoppala
2016-01-19 07:53:29 UTC
The dmesg traces on both cases quite identical, the triggering test being kms_flip --r basic-plain-flip [ 232.182718] kms_flip: starting subtest basic-plain-flip [ 233.614716] [drm] RC6 on [ 244.593971] [drm] RC6 on [ 252.955029] ------------[ cut here ]------------ [ 252.955049] WARNING: CPU: 7 PID: 6205 at drivers/gpu/drm/i915/intel_runtime_pm.c:578 skl_enable_dc6+0x16a/0x190 [i915]() [ 252.955050] DC6 already programmed to be enabled. ... diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c b/drivers/gpu/drm/i915/inte index bbca527..a90d4d0 100644 --- a/drivers/gpu/drm/i915/intel_runtime_pm.c +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c @@ -497,7 +497,8 @@ static void gen9_set_dc_state(struct drm_i915_private *dev_p val &= ~mask; val |= state; I915_WRITE(DC_STATE_EN, val); - POSTING_READ(DC_STATE_EN); + if (wait_for((I915_READ(DC_STATE_EN) & mask) == state, 500)) + DRM_ERROR("Timeout whilst enabling DC6\n"); } void bxt_enable_dc9(struct drm_i915_private *dev_priv) Just a note: We've also seen instances of this happening with patchwork runs for patches which only change non-skl specific code (i.e. not even generic code). This is definitely real at least on this specific machine. Would be good to have more skl machines up&running to figure out whether it's just a broken machine or a larger issue with our skl support. (In reply to Chris Wilson from comment #2) > diff --git a/drivers/gpu/drm/i915/intel_runtime_pm.c > b/drivers/gpu/drm/i915/inte > index bbca527..a90d4d0 100644 > --- a/drivers/gpu/drm/i915/intel_runtime_pm.c > +++ b/drivers/gpu/drm/i915/intel_runtime_pm.c > @@ -497,7 +497,8 @@ static void gen9_set_dc_state(struct drm_i915_private > *dev_p > val &= ~mask; > val |= state; > I915_WRITE(DC_STATE_EN, val); > - POSTING_READ(DC_STATE_EN); > + if (wait_for((I915_READ(DC_STATE_EN) & mask) == state, 500)) > + DRM_ERROR("Timeout whilst enabling DC6\n"); > } > > void bxt_enable_dc9(struct drm_i915_private *dev_priv) Indeed something goes wrong around here. Sometimes when we disable, DC6 stays disabled for only short amount of time (and/or few reads) but it pops back up as enabled. Thats still a mystery. But what happens next is that we have DC6 ON and we try to reset, and apparently the resetting while on dc6 doesn't lead to success. It seems we're accessing DC_STATE_EN with PG0 off. The DMC should trap and fix this but apparently fails. Looks like it's not waiting for PG0 to come back up. The result is that we mess up the DMC hw state bits and get unexpected results. Reading DC_STATE_EN with PG0 off gives me all bits as 1's. We could poll until we get sensible data from the register or perhaps just manually power on PG0. Bug scrub: ---------- Assigned to Patrick *** Bug 93697 has been marked as a duplicate of this bug. *** Mika, I believe this got fixed by patches sent by Imre (or do we still have hangs?). Did we pinpoint it to a specific patch? Can this bug be closed? (In reply to Patrik Jakobsson from comment #8) > Mika, I believe this got fixed by patches sent by Imre (or do we still have > hangs?). Did we pinpoint it to a specific patch? Can this bug be closed? Tried to pinpoint but I think it was combination of Imre's rpm fixes with dmc state harderning ones. We haven't seen this since then. I seem to be affected by this bug as well, here is the kern.log snippet: [ 537.187920] ------------[ cut here ]------------ [ 537.187949] WARNING: CPU: 2 PID: 5 at drivers/gpu/drm/i915/intel_runtime_pm.c:686 skl_enable_dc6+0xb5/0 [ 537.187949] DC6 already programmed to be enabled. [ 537.187950] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat tun bnep i2c_designware_platform i2c_designware_core dell_wmi dell_rbtn dell_laptop dm_crypt snd_hda_codec_hdm p_thermal dell_smbios intel_powerclamp dcdbas coretemp crct10dif_pclmul crc32_pclmul snd_hda_codec_realtek codec_generic aesni_intel aes_x86_64 crypto_simd glue_helper cryptd joydev snd_usb_audio snd_usb_toneport s o_raw snd_usb_line6 snd_hda_codec snd_hwdep snd_seq_midi snd_seq_midi_event snd_hda_core snd_rawmidi snd_se ds uvcvideo videobuf2_vmalloc snd_seq_device iwlmvm videobuf2_memops snd_timer videobuf2_v4l2 usblp videobu [ 537.187970] videodev mac80211 snd media soundcore rtsx_pci_ms btusb memstick btrtl idma64 iwlwifi mei_ ss_pci processor_thermal_device shpchp intel_soc_dts_iosf hci_uart btbcm btqca btintel int3403_thermal blue i intel_lpss int3402_thermal int3400_thermal int340x_thermal_zone mac_hid acpi_pad acpi_thermal_rel intel_h rqfd irqbypass vfio_iommu_type1 vfio pci_stub parport_pc ppdev lp parport autofs4 rtsx_pci_sdmmc mmc_core n ttm psmouse i2c_algo_bit drm_kms_helper firewire_ohci firewire_core syscopyarea sysfillrect crc_itu_t sysim tsx_pci drm i2c_hid wmi pinctrl_sunrisepoint pinctrl_intel [ 537.187993] CPU: 2 PID: 5 Comm: kworker/u16:0 Tainted: G W 4.11.3 #36 [ 537.187993] Hardware name: Dell Inc. Precision 5510/08R8KJ, BIOS 1.2.13 08/08/2016 [ 537.188007] Workqueue: i915-dp i915_digport_work_func [i915] [ 537.188008] Call Trace: [ 537.188010] dump_stack+0x4d/0x66 [ 537.188012] __warn+0xc6/0xe0 [ 537.188013] warn_slowpath_fmt+0x55/0x80 [ 537.188025] ? fwtable_read32+0x9c/0x1c0 [i915] [ 537.188034] ? skl_set_power_well+0x143/0x5e0 [i915] [ 537.188043] skl_enable_dc6+0xb5/0xc0 [i915] [ 537.188051] gen9_dc_off_power_well_disable+0x2b/0x30 [i915] [ 537.188059] intel_power_well_disable+0x39/0x40 [i915] [ 537.188068] intel_display_power_put+0xcf/0x140 [i915] [ 537.188080] intel_dp_hpd_pulse+0x146/0x2f0 [i915] [ 537.188092] i915_digport_work_func+0x88/0x100 [i915] [ 537.188094] process_one_work+0x1ec/0x480 [ 537.188095] worker_thread+0x43/0x4d0 [ 537.188096] kthread+0x103/0x140 [ 537.188097] ? process_one_work+0x480/0x480 [ 537.188098] ? kthread_create_on_node+0x60/0x60 [ 537.188099] ret_from_fork+0x29/0x40 [ 537.188100] ---[ end trace 612d405cb24a9eda ]--- I'm running Linux 4.11.3 on Ubuntu 17.04, I have a Dell Precision m5510 with an Intel GPU on the 915 chipset: 00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics P530 [8086:191d] (rev 06) DeviceName: Onboard IGD Subsystem: Dell HD Graphics P530 [1028:06e5] Kernel driver in use: i915 Kernel modules: i915 01:00.0 3D controller [0302]: NVIDIA Corporation GM107GLM [Quadro M1000M] [10de:13b1] (rev a2) Subsystem: Dell GM107GLM [Quadro M1000M] [1028:06e5] Kernel driver in use: nouveau Kernel modules: nvidiafb, nouveau I'm using graphics drivers from Padoka PPA, the current versions are libdrm-intel1:amd64 2.4.81+git1706051541.16444e1~z~padoka0 *mesa* 1:17.2~git170605162900.4b1e6ed~z~padoka0 libwayland* 1.13.0+git201705130017.0eefe99~z~padoka0 Please let me know if you need any more info from me. I'm happy to do a bisect if I can find a way to reproduce that error. Usually it happens when I leave the laptop sitting idle. After that I can't even REISUB, it's going completely unresponsive. Hi Igor, Is it possible for you to try to reproduce with drm-tip? (https://cgit.freedesktop.org/drm-tip/log/) Hi Mika! Thanks for the reply, sorry to keep you waiting for so long - forgot to sign up for notifications from this thread. Sure, I'd be happy to try. Do I need to use more verbose debugging or something of that sort? So I have installed 4.12-rc6 from drm-tip, and this is what the kern.log entry looks like now for me: Jun 22 17:27:06 precision kernel: [ 8177.720306] DC6 already programmed to be enabled. Jun 22 17:27:06 precision kernel: [ 8177.720330] ------------[ cut here ]------------ Jun 22 17:27:06 precision kernel: [ 8177.720366] WARNING: CPU: 6 PID: 7082 at drivers/gpu/drm/i915/intel_runtime_pm.c:725 skl_enable_dc6+0x9f/0xb0 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720366] Modules linked in: uinput ccm rfcomm hid_multitouch cmac bnep i2c_designware_platform i2c_designware_core dell_wmi nls_iso8859_1 snd_hda_codec_hdmi intel_rapl dell_rbtn x86_pkg_temp_thermal dell_laptop intel_powerclamp dell_smbios dcdbas coretemp kvm_intel snd_hda_codec_realtek snd_hda_codec_generic kvm irqbypass iwlmvm snd_usb_toneport snd_usb_line6 joydev mac80211 snd_hda_intel snd_usb_audio serio_raw snd_hda_codec snd_usbmidi_lib snd_hwdep snd_hda_core snd_rawmidi snd_seq snd_pcm snd_seq_device uvcvideo input_leds iwlwifi rtsx_pci_ms snd_timer videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 memstick snd videobuf2_core mei_me videodev soundcore usblp mei media idma64 btusb intel_lpss_pci intel_pch_thermal btrtl processor_thermal_device ie31200_edac shpchp intel_soc_dts_iosf hci_uart Jun 22 17:27:06 precision kernel: [ 8177.720385] btbcm serdev btqca int3403_thermal btintel bluetooth dell_smo8800 ecdh_generic intel_lpss_acpi intel_lpss int3402_thermal int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad intel_hid mac_hid parport_pc ppdev lp parport efivarfs autofs4 btrfs xor raid6_pq algif_skcipher af_alg dm_crypt dm_mirror dm_region_hash dm_log rtsx_pci_sdmmc mmc_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc i915 aesni_intel nouveau aes_x86_64 crypto_simd glue_helper cryptd mxm_wmi psmouse ttm prime_numbers i2c_algo_bit firewire_ohci drm_kms_helper firewire_core crc_itu_t syscopyarea sysfillrect sysimgblt nvme fb_sys_fops nvme_core rtsx_pci drm i2c_hid wmi pinctrl_sunrisepoint pinctrl_intel Jun 22 17:27:06 precision kernel: [ 8177.720405] CPU: 6 PID: 7082 Comm: kworker/u16:1 Tainted: G W 4.12.0-rc6+ #1 Jun 22 17:27:06 precision kernel: [ 8177.720406] Hardware name: Dell Inc. Precision 5510/08R8KJ, BIOS 1.2.25 05/07/2017 Jun 22 17:27:06 precision kernel: [ 8177.720424] Workqueue: i915-dp i915_digport_work_func [i915] Jun 22 17:27:06 precision kernel: [ 8177.720425] task: ffff969ca176c440 task.stack: ffffa81a86888000 Jun 22 17:27:06 precision kernel: [ 8177.720439] RIP: 0010:skl_enable_dc6+0x9f/0xb0 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720439] RSP: 0018:ffffa81a8688bd50 EFLAGS: 00010282 Jun 22 17:27:06 precision kernel: [ 8177.720440] RAX: 0000000000000025 RBX: ffff969cd4380000 RCX: 0000000000000000 Jun 22 17:27:06 precision kernel: [ 8177.720440] RDX: 0000000000000000 RSI: ffff969cfdd8cc88 RDI: ffff969cfdd8cc88 Jun 22 17:27:06 precision kernel: [ 8177.720441] RBP: ffffa81a8688bd58 R08: 0000000000000001 R09: 00000000000006b4 Jun 22 17:27:06 precision kernel: [ 8177.720441] R10: 0000000000000040 R11: 0000000000000000 R12: ffff969cd4380000 Jun 22 17:27:06 precision kernel: [ 8177.720442] R13: ffffffffc06ccde0 R14: ffff969cd4380000 R15: 0000000020000000 Jun 22 17:27:06 precision kernel: [ 8177.720443] FS: 0000000000000000(0000) GS:ffff969cfdd80000(0000) knlGS:0000000000000000 Jun 22 17:27:06 precision kernel: [ 8177.720443] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 22 17:27:06 precision kernel: [ 8177.720443] CR2: 0000364890f0e000 CR3: 000000076e20a000 CR4: 00000000003406e0 Jun 22 17:27:06 precision kernel: [ 8177.720444] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Jun 22 17:27:06 precision kernel: [ 8177.720444] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Jun 22 17:27:06 precision kernel: [ 8177.720445] Call Trace: Jun 22 17:27:06 precision kernel: [ 8177.720458] gen9_dc_off_power_well_disable+0x2b/0x30 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720470] intel_power_well_disable+0x39/0x40 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720482] intel_display_power_put+0xb5/0x110 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720499] intel_dp_hpd_pulse+0x229/0x310 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720515] i915_digport_work_func+0x88/0x100 [i915] Jun 22 17:27:06 precision kernel: [ 8177.720518] process_one_work+0x1d9/0x3e0 Jun 22 17:27:06 precision kernel: [ 8177.720519] worker_thread+0x43/0x3e0 Jun 22 17:27:06 precision kernel: [ 8177.720520] kthread+0x103/0x140 Jun 22 17:27:06 precision kernel: [ 8177.720521] ? trace_event_raw_event_workqueue_work+0xa0/0xa0 Jun 22 17:27:06 precision kernel: [ 8177.720522] ? kthread_create_on_node+0x60/0x60 Jun 22 17:27:06 precision kernel: [ 8177.720543] ret_from_fork+0x22/0x30 Jun 22 17:27:06 precision kernel: [ 8177.720544] Code: 05 67 c0 15 00 01 e8 c1 76 fb e1 0f ff eb 99 80 3d 56 c0 15 00 00 75 a7 48 c7 c7 48 cb 6c c0 c6 05 46 c0 15 00 01 e8 a1 76 fb e1 <0f> ff eb 90 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 48 83 bf 38 Jun 22 17:27:06 precision kernel: [ 8177.720559] ---[ end trace 6339127b4bfb1449 ]--- Jun 22 17:27:06 precision kernel: [ 8177.720859] [drm:gen9_set_dc_state [i915]] *ERROR* DC state mismatch (0x0 -> 0x2) Jun 22 17:28:16 precision kernel: [ 8247.122737] [drm:gen9_set_dc_state [i915]] *ERROR* DC state mismatch (0x0 -> 0x2) The most frustrating thing that it seems to be only happening when I'm away, although I don't have a suspend timeout set. Plus it only seems to be happening when I use an external monitor via a thunderbolt port. Can I provide any more details here? Igor, you have different bug although with similar symptom about dc6 state. Please try to reproduce bug again with drm-tip kernel and drm.debug=0xe. Can you attach whole dmesg? I take it that you can log through ssh? Also attach error state if there is one 'cat /sys/class/drm/card0/error' closing |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.