Summary: | [bdw-u iommu] DMAR error -> GPU hang | ||
---|---|---|---|
Product: | DRI | Reporter: | Stefan Junker <code> |
Component: | DRM/Intel | Assignee: | Joonas Lahtinen <joonas.lahtinen> |
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | high | CC: | ad_sicks, anarsoul, baptiste, bl, bobbypowers, bordjukov, bugger, bugs.fdo, bugs, chuanxiao.dong, corsac, damir.banic, dastergon, doa379, dwmw2, dyadkin, etn45p4m, freedesktop.org, gary.c.wang, geromanas, giovanni.grc96, hiroru, humberto.i.perez.rodriguez, intel-gfx-bugs, Jan.Henke, jethompson81, joon.jung, jp.pozzi, klondike, lakshminarayana.vudum, leho, linux, lskrejci, madhavigelli, mads, marcin.marcin.m, maria.g.perez.ibarra, marius.c.vlad, matthias.h.nagel, milosvova, niko.pavlinek, nullity.research, pachoramos1, pmenzel+bugs.freedesktop.org, prochazka.nicolas, ragadinks, rdragone+bugzilla, sami.liedes, siflfran, stsp2, tod.jackson, udo, ungift-ed, yunying.sun |
Version: | XOrg git | ||
Hardware: | All | ||
OS: | Linux (All) | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | BDW | i915 features: | GEM/Other, GPU hang |
Attachments: |
Short version --- adding 'intel_iommu=igfx_off' helped Long version --- I've tried many things to resolve this issue, from kernel reconfiguration to installing mesa, libdrm, intel drivers from latest repository masters, which all didn't help. I reverted back to the most recent releases of the packages. From an older forum entry somewhere on the webs I found that this could be related to virtualization techologies and memory remapping, so I added the following arguments to my kernel commandline: 'intel_iommu=igfx_off' Ever since (about 10-15 hours of very active usage) I haven't had a single freeze. I still think this is not normal behavior, since turning off iommu for the GPU can't be the right or necessary thing to do. Created attachment 113998 [details]
bpowers' dmesg
Created attachment 113999 [details]
bpowers' GPU crash dump
I'm also seeing this, on an i7-5600U. Kernel 4.0.0-rc2+, xf86-video-intel from a few days ago (9fb8154), mesa 10.5-rc3. I turned off VT-d in the BIOS, and haven't seen any issues since. I've attached both my dmesg + GPU crash dump from one of the times this happened. some additional info - my BIOS (3rd gen x1 carbon) apparently marks x2apic as broken. I booted a number of times with intremap=no_x2apic_optout on the kernel command line, and saw what steveej mentioned: a hard freeze. The system did have the foresight to save the dmesg into the EFI pstore. I have those logs if they are useful. After removing no_x2apic_optout, the kernel "Enabled IRQ remapping in xapic mode", and under xapic some of the time the kernel was able to recover/reset the chip to an ok-enough state that I could save dmesg and grap the GPU dump from /sys/class/drm/card0/error. *** Bug 90823 has been marked as a duplicate of this bug. *** *** Bug 90091 has been marked as a duplicate of this bug. *** *** Bug 91076 has been marked as a duplicate of this bug. *** In bug 91076, the hangs are anything but random. Simply initialising execlists is enough for the GPU to die. *** Bug 91152 has been marked as a duplicate of this bug. *** Quick experiment from bug 91152 indicates that this is a problem with or without execlists enabled. *** Bug 91458 has been marked as a duplicate of this bug. *** I've noticed the same problem here. I also get frequent DMAR errors without hangs during ubuntu's fade to black animation for dpms off. I found the same problem as in comment #4. If I disable VT-d in the BIOS the crashes disappear. But then I get random segmentation faults from GCC if I try to compile QtWebKit (N.b: I have gentoo and compile all packages by myself.) Hence, I have two options (1) Disable VT-d for daily work such that i915 does not crash (2) Enable VT-d and only boot into text console mode if I need to compile QtWebKit My hardware is Lenovo Thinkpad X1 Carbon (3rd generation, type 20BT) Processor Intel Core i5-5200U Memory: 8 GB PC3-12800L BIOS: Phoenix, ver. 1.09, 7/22/2015 Interestingly, MemTest86 does not report any memory error in either case (with and without VT-d) Sounds more like an actual corruption prevented by DMAR then.. *** Bug 91633 has been marked as a duplicate of this bug. *** *** Bug 91764 has been marked as a duplicate of this bug. *** I'm not sure if 91764 really is a duplicate of this bug. intel_iommu=igfx_off doesn't help in my case. In my case this issue (googling for the opcode hanging the GPU lead me to this bug) was solved by disabling the EFI Framebuffer on the kernel configuration. If the devs want I can open a second bug to request the Intel GFX drivers taking over from early framebuffers (for example EFI or VGA) to prevent my particular issues. Looking at the first two dumps, this looks like it might be a simple driver bug. The driver forgets to use the DMA API and wrongly just hands a physical address to the device. The device does DMA to that invalid address, takes a well-deserved fault, and is subsequently unhappy. The faulting addresses do not look like addresses which would be given out as virtual DMA addresses by the DMA API. Such addresses would typically start at 0xfffff000 and grow downwards. Created attachment 118605 [details] kernel log on 4.2.2 I reported bug #90091 initially, which was marked as duplicate of this one. I've tried to reproduce with 4.2.2, and it still happens. Two dmesgs are attached, the second one with i915.enable_execlists=0. In the latter case I only have the DMAR fault but not the GPU hang. Created attachment 118606 [details]
kernel log on 4.2.2 with execlists disabled
Second log, with execlists disabled. Actually in the meantime I managed to get a GPU hang.
Offending userland code looks like chromium, which I guess uses the DDX intensively. *** Bug 92531 has been marked as a duplicate of this bug. *** I have the same problem, and Google Chrome (as in the binary release, not Chromium) seems to trigger it. Disabling 3D rendering in X has helped, as now I get a crash and hang every couple of weeks instead of around once a day. Here's my uname output, in case it helps: Linux laurana 4.0.5-gentoo #5 SMP Tue Sep 22 09:45:32 CEST 2015 x86_64 Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz GenuineIntel GNU/Linux I'll try the 'intel_iommu=igfx_off' kernel option and see if that improves matters. Failing that, I'll disable VT-d, unless it is required by Docker. If I can provide any helpful info, please let me know and I'll be happy to do so. *** Bug 92905 has been marked as a duplicate of this bug. *** Is there anything missing from us users about this? What can we do to push this forward? *** Bug 94229 has been marked as a duplicate of this bug. *** Created attachment 122465 [details]
Resetting chip after gpu hang
I have here an Dell XPS 13 9350 2016 (Intel Core i7-6560U ) and have installed Fedora 23 (GNOME), currently on Kernel 4.4.5-300.fc23.x86_64.
As soon I open Chrome or Firefox and let’s say open YouTube after few second whole laptop will hang, totally unresponsible. Need to press Power ON button for few seconds to power of the device.
Once I was able to get log and its attached.
I added "intel_iommu=igfx_off" to /etc/default/grub and regenerated GRUB ( grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg) but it does not solve the problem here.
If any more info is needed I would be glad to provide them to you. I hope I will be able to use my new Notebook without issue.
Comment on attachment 122465 [details] Resetting chip after gpu hang >[Mon Mar 21 18:02:39 2016] i915 0000:00:02.0: Invalid ROM contents >[Mon Mar 21 18:04:52 2016] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead. >[Mon Mar 21 18:05:54 2016] [drm] stuck on render ring >[Mon Mar 21 18:05:54 2016] [drm] GPU HANG: ecode 9:0:0x85dfbfff, in chrome [3398], reason: Ring hung, action: reset >[Mon Mar 21 18:05:54 2016] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. >[Mon Mar 21 18:05:54 2016] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel >[Mon Mar 21 18:05:54 2016] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. >[Mon Mar 21 18:05:54 2016] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. >[Mon Mar 21 18:05:54 2016] [drm] GPU crash dump saved to /sys/class/drm/card0/error >[Mon Mar 21 18:05:54 2016] ------------[ cut here ]------------ >[Mon Mar 21 18:05:54 2016] WARNING: CPU: 1 PID: 423 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]() >[Mon Mar 21 18:05:54 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips)) >[Mon Mar 21 18:05:54 2016] Modules linked in: >[Mon Mar 21 18:05:54 2016] rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_filter ebtable_nat ebtables ip6table_security ip6table_mangle ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables iptable_security iptable_mangle iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bnep hid_multitouch snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal coretemp snd_soc_skl dell_led snd_soc_skl_ipc snd_hda_ext_core kvm_intel snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_core kvm snd_hda_codec_realtek vfat snd_hda_codec_generic fat snd_compress snd_pcm_dmaengine ac97_bus >[Mon Mar 21 18:05:54 2016] dell_wmi sparse_keymap dw_dmac_core i2c_designware_platform dell_laptop i2c_designware_core snd_hda_intel irqbypass dcdbas brcmfmac snd_hda_codec brcmutil snd_hda_core snd_hwdep cfg80211 snd_seq snd_seq_device uvcvideo rtsx_pci_ms snd_pcm videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 videobuf2_core v4l2_common joydev snd_timer videodev snd mei_me btusb i2c_i801 soundcore btrtl idma64 media mei shpchp intel_lpss_pci hci_uart wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi intel_lpss processor_thermal_device int3403_thermal int340x_thermal_zone int3400_thermal intel_soc_dts_iosf acpi_als iosf_mbi acpi_thermal_rel kfifo_buf acpi_pad industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core >[Mon Mar 21 18:05:54 2016] crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper drm serio_raw nvme rtsx_pci i2c_hid video fjes >[Mon Mar 21 18:05:54 2016] CPU: 1 PID: 423 Comm: kworker/1:3 Not tainted 4.4.5-300.fc23.x86_64 #1 >[Mon Mar 21 18:05:54 2016] Hardware name: Dell Inc. XPS 13 9350/XXXXX, BIOS 1.2.3 01/08/2016 >[Mon Mar 21 18:05:54 2016] Workqueue: events intel_mmio_flip_work_func [i915] >[Mon Mar 21 18:05:54 2016] 0000000000000286 0000000093832e83 ffff880272cb7d20 ffffffff813b54ae >[Mon Mar 21 18:05:54 2016] ffff880272cb7d68 ffffffffa01f9de8 ffff880272cb7d58 ffffffff810a40f2 >[Mon Mar 21 18:05:54 2016] ffff880215ebf0c0 ffff880280c96600 ffff880280c9b000 0000000000000040 >[Mon Mar 21 18:05:54 2016] Call Trace: >[Mon Mar 21 18:05:54 2016] [<ffffffff813b54ae>] dump_stack+0x63/0x85 >[Mon Mar 21 18:05:54 2016] [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0 >[Mon Mar 21 18:05:54 2016] [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80 >[Mon Mar 21 18:05:54 2016] [<ffffffffa01937d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915] >[Mon Mar 21 18:05:54 2016] [<ffffffff810bc596>] process_one_work+0x156/0x430 >[Mon Mar 21 18:05:54 2016] [<ffffffff810bc8be>] worker_thread+0x4e/0x450 >[Mon Mar 21 18:05:54 2016] [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00 >[Mon Mar 21 18:05:54 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 >[Mon Mar 21 18:05:54 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 >[Mon Mar 21 18:05:54 2016] [<ffffffff810c2648>] kthread+0xd8/0xf0 >[Mon Mar 21 18:05:54 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 >[Mon Mar 21 18:05:54 2016] [<ffffffff817a088f>] ret_from_fork+0x3f/0x70 >[Mon Mar 21 18:05:54 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 >[Mon Mar 21 18:05:54 2016] ---[ end trace b5a5acfc195b296b ]--- >[Mon Mar 21 18:05:54 2016] drm/i915: Resetting chip after gpu hang >[Mon Mar 21 18:05:56 2016] [drm] RC6 on [Mon Mar 21 21:23:31 2016] [drm] stuck on render ring [Mon Mar 21 21:23:31 2016] [drm] GPU HANG: ecode 9:0:0x85df9fff, in chrome [5196], reason: Ring hung, action: reset [Mon Mar 21 21:23:31 2016] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [Mon Mar 21 21:23:31 2016] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [Mon Mar 21 21:23:31 2016] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [Mon Mar 21 21:23:31 2016] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [Mon Mar 21 21:23:31 2016] [drm] GPU crash dump saved to /sys/class/drm/card0/error [Mon Mar 21 21:23:31 2016] ------------[ cut here ]------------ [Mon Mar 21 21:23:31 2016] WARNING: CPU: 2 PID: 120 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]() [Mon Mar 21 21:23:31 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips)) [Mon Mar 21 21:23:31 2016] Modules linked in: [Mon Mar 21 21:23:31 2016] rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle bnep hid_multitouch intel_rapl snd_soc_skl x86_pkg_temp_thermal snd_soc_skl_ipc coretemp snd_hda_ext_core snd_soc_sst_ipc snd_hda_codec_hdmi kvm_intel snd_soc_sst_dsp dell_led kvm snd_soc_core brcmfmac vfat snd_hda_codec_realtek fat snd_compress snd_hda_codec_generic snd_pcm_dmaengine ac97_bus dw_dmac_core brcmutil dell_wmi dell_laptop i2c_designware_platform snd_hda_intel i2c_designware_core sparse_keymap dcdbas cfg80211 irqbypass snd_hda_codec [Mon Mar 21 21:23:31 2016] snd_hda_core snd_hwdep snd_seq uvcvideo rtsx_pci_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 snd_seq_device snd_pcm joydev btusb videobuf2_core btrtl v4l2_common videodev snd_timer snd mei_me media i2c_i801 soundcore mei hci_uart idma64 processor_thermal_device shpchp intel_lpss_pci intel_soc_dts_iosf iosf_mbi wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi int3403_thermal intel_lpss int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad acpi_als kfifo_buf industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme serio_raw rtsx_pci i2c_hid video fjes [Mon Mar 21 21:23:31 2016] CPU: 2 PID: 120 Comm: kworker/2:1 Not tainted 4.4.5-300.fc23.x86_64 #1 [Mon Mar 21 21:23:31 2016] Hardware name: Dell Inc. XPS 13 9350/0JXC1H, BIOS 1.2.3 01/08/2016 [Mon Mar 21 21:23:31 2016] Workqueue: events intel_mmio_flip_work_func [i915] [Mon Mar 21 21:23:31 2016] 0000000000000286 000000005e30db05 ffff880273f0fd20 ffffffff813b54ae [Mon Mar 21 21:23:31 2016] ffff880273f0fd68 ffffffffa01f4de8 ffff880273f0fd58 ffffffff810a40f2 [Mon Mar 21 21:23:31 2016] ffff8802318a8140 ffff880280d16600 ffff880280d1b000 0000000000000080 [Mon Mar 21 21:23:31 2016] Call Trace: [Mon Mar 21 21:23:31 2016] [<ffffffff813b54ae>] dump_stack+0x63/0x85 [Mon Mar 21 21:23:31 2016] [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0 [Mon Mar 21 21:23:31 2016] [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80 [Mon Mar 21 21:23:31 2016] [<ffffffffa018e7d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915] [Mon Mar 21 21:23:31 2016] [<ffffffff810bc596>] process_one_work+0x156/0x430 [Mon Mar 21 21:23:31 2016] [<ffffffff810bc8be>] worker_thread+0x4e/0x450 [Mon Mar 21 21:23:31 2016] [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00 [Mon Mar 21 21:23:31 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 [Mon Mar 21 21:23:31 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 [Mon Mar 21 21:23:31 2016] [<ffffffff810c2648>] kthread+0xd8/0xf0 [Mon Mar 21 21:23:31 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 [Mon Mar 21 21:23:31 2016] [<ffffffff817a088f>] ret_from_fork+0x3f/0x70 [Mon Mar 21 21:23:31 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 [Mon Mar 21 21:23:31 2016] ---[ end trace 2e1339ae2448560b ]--- [Mon Mar 21 21:23:31 2016] drm/i915: Resetting chip after gpu hang [Mon Mar 21 21:23:33 2016] [drm] RC6 on [Mon Mar 21 21:23:47 2016] [drm] stuck on render ring [Mon Mar 21 21:23:47 2016] [drm] GPU HANG: ecode 9:0:0x85dfbfff, in chrome [5273], reason: Ring hung, action: reset [Mon Mar 21 21:23:47 2016] ------------[ cut here ]------------ [Mon Mar 21 21:23:47 2016] WARNING: CPU: 2 PID: 120 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]() [Mon Mar 21 21:23:47 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips)) [Mon Mar 21 21:23:47 2016] Modules linked in: [Mon Mar 21 21:23:47 2016] rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle bnep hid_multitouch intel_rapl snd_soc_skl x86_pkg_temp_thermal snd_soc_skl_ipc coretemp snd_hda_ext_core snd_soc_sst_ipc snd_hda_codec_hdmi kvm_intel snd_soc_sst_dsp dell_led kvm snd_soc_core brcmfmac vfat snd_hda_codec_realtek fat snd_compress snd_hda_codec_generic snd_pcm_dmaengine ac97_bus dw_dmac_core brcmutil dell_wmi dell_laptop i2c_designware_platform snd_hda_intel i2c_designware_core sparse_keymap dcdbas cfg80211 irqbypass snd_hda_codec [Mon Mar 21 21:23:47 2016] snd_hda_core snd_hwdep snd_seq uvcvideo rtsx_pci_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 snd_seq_device snd_pcm joydev btusb videobuf2_core btrtl v4l2_common videodev snd_timer snd mei_me media i2c_i801 soundcore mei hci_uart idma64 processor_thermal_device shpchp intel_lpss_pci intel_soc_dts_iosf iosf_mbi wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi int3403_thermal intel_lpss int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad acpi_als kfifo_buf industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme serio_raw rtsx_pci i2c_hid video fjes [Mon Mar 21 21:23:47 2016] CPU: 2 PID: 120 Comm: kworker/2:1 Tainted: G W 4.4.5-300.fc23.x86_64 #1 [Mon Mar 21 21:23:47 2016] Hardware name: Dell Inc. XPS 13 9350/0JXC1H, BIOS 1.2.3 01/08/2016 [Mon Mar 21 21:23:47 2016] Workqueue: events intel_mmio_flip_work_func [i915] [Mon Mar 21 21:23:47 2016] 0000000000000286 000000005e30db05 ffff880273f0fd20 ffffffff813b54ae [Mon Mar 21 21:23:47 2016] ffff880273f0fd68 ffffffffa01f4de8 ffff880273f0fd58 ffffffff810a40f2 [Mon Mar 21 21:23:47 2016] ffff880066669440 ffff880280d16600 ffff880280d1b000 0000000000000080 [Mon Mar 21 21:23:47 2016] Call Trace: [Mon Mar 21 21:23:47 2016] [<ffffffff813b54ae>] dump_stack+0x63/0x85 [Mon Mar 21 21:23:47 2016] [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0 [Mon Mar 21 21:23:47 2016] [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80 [Mon Mar 21 21:23:47 2016] [<ffffffffa018e7d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915] [Mon Mar 21 21:23:47 2016] [<ffffffff810bc596>] process_one_work+0x156/0x430 [Mon Mar 21 21:23:47 2016] [<ffffffff810bc8be>] worker_thread+0x4e/0x450 [Mon Mar 21 21:23:47 2016] [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00 [Mon Mar 21 21:23:47 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 [Mon Mar 21 21:23:47 2016] [<ffffffff810bc870>] ? process_one_work+0x430/0x430 [Mon Mar 21 21:23:47 2016] [<ffffffff810c2648>] kthread+0xd8/0xf0 [Mon Mar 21 21:23:47 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 [Mon Mar 21 21:23:47 2016] [<ffffffff817a088f>] ret_from_fork+0x3f/0x70 [Mon Mar 21 21:23:47 2016] [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160 [Mon Mar 21 21:23:47 2016] ---[ end trace 2e1339ae2448560c ]--- [Mon Mar 21 21:23:47 2016] drm/i915: Resetting chip after gpu hang [Mon Mar 21 21:23:49 2016] [drm] RC6 on Can someone delete my last comment (and also this), I was going to add attachment and not comment the whole calltrace log. Created attachment 122775 [details]
Kernel log with i915 hang
Created attachment 122776 [details]
Error from intel card
I have a Dell Latitude e7250 with: - Intel(R) Core(TM) i7-5600U CPU - Intel Corporation Broadwell-U Integrated Graphics (rev 09) With the same behavior here. I've tried to disable other framebuffer devices such as EFI, but in the meanwhile the only process which worked for me was disabling the iommu in the i915 kernel module, which is not a clean approach. I've attached my dumps and hope it helps. I'll try Arch Linux anyway to see if has the same problem. Sergi *** Bug 94959 has been marked as a duplicate of this bug. *** It is still unfixed as of now - more than one year after reported. It is crucial - makes laptop unusable for ANY kind of video tasks, as with igfx_off it works really slow. Will such a critical bug finally be assigned to someone? It was reported so many times and affected so many users... Or at least any thoughts, like where to watch this damn DMAR code for this bug? I have this, too on i5-5287U the iommu for Linux is disabled by default For better kvm virtualization, it can be turned on using kernel command line parameter intel_iommu=on with this kernel parameter present Linux freezes during boot This is actually fixed for me in 4.7. I have a Thinkpad X1 Carbon with an Intel i7-5600U, and haven't seen any issues when enabling the IOMMU over the last month or so (I don't remember if things also worked under 4.6, but I Was running 4.7 RCs for a bit). Hope that helps. I built and tried 4.7.2 last night and enabling VT-d still causes the freezes to happen so I don't think it is fixed yet. Kernel log messages when the freezes happen contain the following when it was able to recover from the freeze: [ 1936.694513] [drm] stuck on render ring [ 1936.694899] [drm] GPU HANG: ecode 8:0:0x85dffffb, in X [3356], reason: Engine(s) hung, action: reset [ 1936.696494] drm/i915: Resetting chip after gpu hang And the last one that killed it, requiring the power button to be held to turn the machine off, had a different ecode: [ 1944.706379] [drm] stuck on render ring [ 1944.706694] [drm] GPU HANG: ecode 8:0:0xbf9fffff, reason: Engine(s) hung, action: reset [ 1944.708378] drm/i915: Resetting chip after gpu hang Created attachment 127048 [details]
drm_error crash dump of brots ThinkPad X250
Hello everyone, First of all, sorry i forgot to add the comment to the file. I think am also hitting this bug, i think. The system is a Lenovo ThinkPad X250 - Intel(R) Core(TM) i5-5200U and the HD Graphics 5500. Since i started using the IOMMU i have been getting hangs and also some X restarts. This is what dmesg says after the fact: [ 1409.513438] DMAR: DRHD: handling fault status reg 3 [ 1409.513451] DMAR: [DMA Read] Request device [00:02.0] fault addr f5995000 [fault reason 05] PTE Write access is not set [ 1409.513468] DMAR: DRHD: handling fault status reg 3 [ 1409.513471] DMAR: [DMA Write] Request device [00:02.0] fault addr f5968000 [fault reason 23] Unknown [ 1418.830396] [drm] GPU HANG: ecode 8:0:0x85dffffb, in kwin_x11 [2191], reason: Hang on render ring, action: reset [ 1418.830407] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 1418.830409] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 1418.830410] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 1418.830412] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 1418.830413] [drm] GPU crash dump saved to /sys/class/drm/card0/error I will upload said crash dump here. If i can provide you with any more debug-logs or should test something, i would be glad to help. Thanks, Michael *** Bug 98309 has been marked as a duplicate of this bug. *** (In reply to yann from comment #42) > *** Bug 98309 has been marked as a duplicate of this bug. *** I would be careful not to mix gen8/gen9 reports for the moment, not until we have the root cause. (In reply to Chris Wilson from comment #43) > (In reply to yann from comment #42) > > *** Bug 98309 has been marked as a duplicate of this bug. *** > > I would be careful not to mix gen8/gen9 reports for the moment, not until we > have the root cause. Well I pointed out early that one of the causes seems to be a conflict between the UEFI framebuffer driver and the intel one, most likely because of some race conditions or both trying to access the same hardware at the same time. For me disabling the EFI framebuffer solved the issue so far so maybe other reporters may want to test and see if that solves the issue for them too. (In reply to klondike from comment #44) > > For me disabling the EFI framebuffer solved the issue so far so maybe other > reporters may want to test and see if that solves the issue for them too. I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs. (In reply to Mads from comment #45) > (In reply to klondike from comment #44) > > > > For me disabling the EFI framebuffer solved the issue so far so maybe other > > reporters may want to test and see if that solves the issue for them too. > > I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't > get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs. Interesting, it seems then that there is more than one different instance of this bug then. Do you have any other FB or driver that interacts with the intel card other than the Intel's modesetting one? The VGA console could be one such driver. (In reply to Mads from comment #45) > (In reply to klondike from comment #44) > > > > For me disabling the EFI framebuffer solved the issue so far so maybe other > > reporters may want to test and see if that solves the issue for them too. > > I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't > get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs. I failed to mention that I didn't come across this bug until trying some drm-intel-nightly based on 4.9.0, but efifb was never involved. intel_iommu=igfx_off also solved a long standing bug with suspend on my machine, bug 97211 VGA_CONSOLE is built in. But so is also DRM_I915, with firmware, so this appears is dmesg: ... [ 1.051229] [drm] Replacing VGA console driver ... But, I don't know if what I'm experiencing is relevant here, since I didn't see this bug appear until just recently with drm-intel-nightly (and fixed again by using intel_iommu=igfx_off) I guess we both are experiencing different bugs with similar symptoms. In my case at least, the UEFI display driver clashes with the Intel one resulting in the IOMMU violations. This seems to be some kind of firmware bug where the firmware isn't playing along with the MMU settings and Intel's driver. In yours, the cause may come from somewhere else, hopefully the devs can provide more guidance on what is triggering your case, but if sleep is involved chances are that firmware is somehow part of the issue. *** Bug 98728 has been marked as a duplicate of this bug. *** *** Bug 99308 has been marked as a duplicate of this bug. *** *** Bug 94780 has been marked as a duplicate of this bug. *** *** Bug 99964 has been marked as a duplicate of this bug. *** Created attachment 130657 [details]
/sys/class/drm/card0/error Intel HD Graphics 5500
Hi! I got bite by this bug recently after enabling IOMMU by intel_iommu=on kernel command line. It happen to me once (for now), soon after resume from suspend my GPU hanged.
I saw someone recommended disabling EFI framebuffe - how do you exactly do this? Here excerpt from my dmesg:
dmesg | grep VGA
fb0: EFI VGA frame buffer device
fb: switching to inteldrmfb from EFI VGA
[drm] Replacing VGA console driver
DMESG log during hang:
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr faf16000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf53000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf4f000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr faf56000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf60000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf5b000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf18000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf1a000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf1c000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf2b000 [fault reason 05] PTE Write access is not set
[drm] GPU HANG: ecode 8:0:0x85dffffb, in Xorg [533], reason: Hang on render ring, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang
dmar_fault: 235 callbacks suppressed
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff267000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff283000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff251000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff291000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2af000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2bf000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2f0000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2bc000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2dc000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2ed000 [fault reason 05] PTE Write access is not set
drm/i915: Resetting chip after gpu hang
[drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
Disabling efi framebuffer in kernel didn't help, 'intel_iommu=on,igfx_off' did. Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set *Logs included *** Bug 101694 has been marked as a duplicate of this bug. *** Chris, I did a experiment that make the intel_unmap as a noop. Then the old dma region keeps mapped, and new allocated buf always has new IOVA. Thus I didn't reproduce this issue. Therefor, I guess BDW GPU may still issue DMA transaction with old unmapped dma region, even that workload has finished. And I always see the hang at PIPE_CTL. --- a/drivers/iommu/intel-iommu.c +++ b/drivers/iommu/intel-iommu.c @@ -3787,7 +3787,7 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size) struct intel_iommu *iommu; struct page *freelist; - if (iommu_no_mapping(dev)) + //if (iommu_no_mapping(dev)) return; domain = find_domain(dev); BTW, sometimes the IOMMU fault addr is 48 bit which more like a Gfx VA, but sometimes it is 33bit or less which more like a IOVA or Pysical Address. Per my understanding IOMMU fault should always be IOVA, so how does a Gfx VA recored, is the dedicated gfx iommu special? Thanks, Changbin. (In reply to Du, Changbin from comment #58) > --- a/drivers/iommu/intel-iommu.c > +++ b/drivers/iommu/intel-iommu.c > @@ -3787,7 +3787,7 @@ static void intel_unmap(struct device *dev, dma_addr_t > dev_addr, size_t size) > struct intel_iommu *iommu; > struct page *freelist; > > - if (iommu_no_mapping(dev)) > + //if (iommu_no_mapping(dev)) > return; > Our QA helped verify on both Nuc and Server paltofrms, the the result is intersting: 1. for server Iris Pro 6200 (Broadwell GT3e), we still can reporudce it w/ above change. 2. for NUC HD Graphics 5500 (Broadwell GT2), didn't see DMAR error w/ above change. GT3e has 128MB eDRAM, while GT2 doesn’t. So probably there is problem in gfx cache or memory. *** Bug 101785 has been marked as a duplicate of this bug. *** *** Bug 100203 has been marked as a duplicate of this bug. *** (In reply to Elizabeth from comment #61) > *** Bug 100203 has been marked as a duplicate of this bug. *** From bug 100203: (In reply to stsp from comment #3) > There are a few similar reports in this bugzilla, > but please note that my dmesg ends with Oops. So > probably it gives more info. *** Bug 101236 has been marked as a duplicate of this bug. *** *** Bug 101238 has been marked as a duplicate of this bug. *** *** Bug 100209 has been marked as a duplicate of this bug. *** Happens to me too when booting 4.13.3 (also .2). Hardware is Thinkpad T450s, latest bios updated DMI: LENOVO 20BWS10N00/20BWS10N00, BIOS JBET65WW (1.29 ) 06/15/2017 i7-5600U (Broadwell) Booting 4.13.3-1-ARCH hangs when starting Xorg (sddm), the message I caught via ssh in the first second before it was hung completely was: [ 123.917760] DMAR: DRHD: handling fault status reg 2 [ 123.917765] DMAR: [DMA Write] Request device [00:02.0] fault addr 108a000 [fault reason 23] Unknown 4.12.13-1-ARCH works fine 4.9.51-1-lts works fine too Distro is Archlinux. Booting in CLI works fine too. Booting with "intel_iommu=on,igfx_off" on the kernel command line also works fine. $ zgrep IOMMU /proc/config.gz CONFIG_GART_IOMMU=y CONFIG_CALGARY_IOMMU=y CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y CONFIG_IOMMU_HELPER=y CONFIG_VFIO_IOMMU_TYPE1=m # CONFIG_VFIO_NOIOMMU is not set CONFIG_IOMMU_API=y CONFIG_IOMMU_SUPPORT=y # Generic IOMMU Pagetable Support CONFIG_IOMMU_IOVA=y CONFIG_AMD_IOMMU=y CONFIG_AMD_IOMMU_V2=m CONFIG_INTEL_IOMMU=y CONFIG_INTEL_IOMMU_SVM=y CONFIG_INTEL_IOMMU_DEFAULT_ON=y CONFIG_INTEL_IOMMU_FLOPPY_WA=y # CONFIG_IOMMU_DEBUG is not set # CONFIG_IOMMU_STRESS is not set (ps. by mistake I added this comment to a closed/duplicate bug of this one. sorry about that) when I set i915.enable_execlists=0 kernel option it doesn't hang afyer resuming from suspend. It spams log with those messages instead: [drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle Patching kernel with workaround posted in https://bugs.freedesktop.org/show_bug.cgi?id=89360#c58 fixes this issue for Intel HD Graphics 5500 (Broadwell GT2) and linux 4.13.3. This bug has affected many people on arch on a variety of devices when a maintainer switched on CONFIG_INTEL_IOMMU_DEFAULT_ON in kernel 4.13.x See the bug report: https://bugs.archlinux.org/task/55629 (In reply to Ernest Hurtado from comment #68) > Patching kernel with workaround posted in > https://bugs.freedesktop.org/show_bug.cgi?id=89360#c58 fixes this issue for > Intel HD Graphics 5500 (Broadwell GT2) and linux 4.13.3. It fixed GPU issue but caused cpu leaks with network stack. Created attachment 134746 [details]
dmesg caught after GPU hang/reset with 4.13.3
Created attachment 134747 [details]
DRM error dump caught after GPU hang/reset with 4.13.3
*** Bug 104802 has been marked as a duplicate of this bug. *** *** Bug 104929 has been marked as a duplicate of this bug. *** First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. (In reply to Jani Saarinen from comment #75) > First of all. Sorry about spam. > This is mass update for our bugs. > > Sorry if you feel this annoying but with this trying to understand if bug > still valid or not. > If bug investigation still in progress, please ignore this and I apologize! > > If you think this is not anymore valid, please comment to the bug that can > be closed. > If you haven't tested with our latest pre-upstream tree(drm-tip), can you do > that also to see if issue is valid there still and if you cannot see issue > there, please comment to the bug. not for me regards Do you mean not seeing issue anymore? sorry, we are using intel_iommu=igfx_off on all our configuration since this bug. regards *** Bug 105823 has been marked as a duplicate of this bug. *** Created attachment 138682 [details]
crash dump of Intel Iris Graphics 6100
Description of problem:
GPU HANG when launch systemctl start graphical.target with kernel flag "intel_iommu=on"
Version-Release number of selected component (if applicable):
xorg-x11-drv-intel-2.99.917-31.20171025.fc27.x86_64
gnome-shell-3.26.2-4.fc27.x86_64
kernel-4.15.14-300.fc27.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Boot with kernel option 'intel_iommu=on'.
2. GPU will hang at the start of GDM login screen.
Actual results:
dmesg:
[ 354.039739] [drm] GPU HANG: ecode 8:-1:0x00000000, reason: Kicking stuck wait on rcs0, action: continue
[ 354.039740] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 354.039740] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 354.039741] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 354.039741] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 354.039742] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Expected results:
GPU not hang
Additional info:
Hardware: Apple MacBook Pro Early 2015 13-inch
GPU: Intel Iris Graphics 6100
Distro: Fedora Workstation 27 x86_64
(In reply to Jani Saarinen from comment #75) > If you haven't tested with our latest pre-upstream tree(drm-tip), can you do > that also to see if issue is valid there still and if you cannot see issue > there, please comment to the bug. I haven't tested on drm-tip but the problem still happens on 4.15 with HD 5500 Chris, do you see testing drm-tip helps here? Reporter, can you test with latest drm-tip that is now on 4.17.0-rc5 (In reply to Jani Saarinen from comment #83) > Reporter, can you test with latest drm-tip that is now on 4.17.0-rc5 I just tried and after a while (system docked with external screen configured, and some vtswitch because lightdm) I had a freeze. Looking at kern.log I have: May 17 18:01:22 scapa kernel: [14619.773523] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe C FIFO underrun May 17 18:01:25 scapa kernel: [14622.309074] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe B FIFO underrun Please attach whole log too so send dmesg with drm.debug=0x1e log_buf_len=4M? (In reply to Jani Saarinen from comment #85) > Please attach whole log too so send dmesg with drm.debug=0x1e log_buf_len=4M? Unfortunately it's not really reproducible at will, and I assume booting with that options will render the system really slow and barely usable? (In reply to Yves-Alexis from comment #86) > Unfortunately it's not really reproducible at will, Then I wonder why the bugs like https://bugs.freedesktop.org/show_bug.cgi?id=100203 https://bugs.freedesktop.org/show_bug.cgi?id=94959 and all the other 100%-reproducible bugs were marked as a duplicate of this one... :( (In reply to stsp from comment #87) > (In reply to Yves-Alexis from comment #86) > > Unfortunately it's not really reproducible at will, > > Then I wonder why the bugs like > https://bugs.freedesktop.org/show_bug.cgi?id=100203 > https://bugs.freedesktop.org/show_bug.cgi?id=94959 > and all the other 100%-reproducible bugs were > marked as a duplicate of this one... :( I can't really comment on the other bug. For me, I can't trigger it right away, but since 3 years now every time I try to remove igfx_off the systems ends up freezing after a while. I try to test latest branches and provide logs, but honestly nothing really changed since 2015. On latest drm-tip with an i7-6560U the problem only occurs if the GUC is enabled. After upgrade to 4.18rc I can't reproduce crash anymore. Marking resolved based on the last two comments. Before reopening, please make sure you can reproduce with the latest drm-tip. Closing, as latest feedback as of two months ago and three weeks ago, is that this is now working. I just had a chance to test 4.18 and I have to say it's *not* fixed. Maybe it's a different bug, but in any case I had a “soft” freeze with following message in dmesg: Aug 29 19:04:17 scapa kernel: [ 26.943249] DMAR: DRHD: handling fault status reg 3 Aug 29 19:04:17 scapa kernel: [ 26.943255] DMAR: [DMA Read] Request device [00:02.0] fault addr 4600000 [fault reason 23] Unknown Aug 29 19:04:17 scapa kernel: [ 26.943259] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:17 scapa kernel: [ 26.943262] DMAR: [DMA Read] Request device [00:02.0] fault addr 4613000 [fault reason 23] Unknown Aug 29 19:04:17 scapa kernel: [ 26.943264] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:17 scapa kernel: [ 26.943267] DMAR: [DMA Read] Request device [00:02.0] fault addr 461b000 [fault reason 23] Unknown Aug 29 19:04:17 scapa kernel: [ 26.943269] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:24 scapa kernel: [ 33.831279] [drm] GPU HANG: ecode 8:0:0x85dffffb, in Xorg [1028], reason: hang on rcs0, action: reset Aug 29 19:04:24 scapa kernel: [ 33.831280] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Aug 29 19:04:24 scapa kernel: [ 33.831281] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Aug 29 19:04:24 scapa kernel: [ 33.831281] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Aug 29 19:04:24 scapa kernel: [ 33.831282] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Aug 29 19:04:24 scapa kernel: [ 33.831282] [drm] GPU crash dump saved to /sys/class/drm/card0/error Aug 29 19:04:24 scapa kernel: [ 33.831298] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Aug 29 19:04:24 scapa kernel: [ 33.838481] dmar_fault: 53 callbacks suppressed Aug 29 19:04:24 scapa kernel: [ 33.838482] DMAR: DRHD: handling fault status reg 3 Aug 29 19:04:24 scapa kernel: [ 33.838487] DMAR: [DMA Write] Request device [00:02.0] fault addr 4641000 [fault reason 23] Unknown Aug 29 19:04:32 scapa kernel: [ 41.824158] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Aug 29 19:04:32 scapa kernel: [ 41.824723] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:32 scapa kernel: [ 41.824729] DMAR: [DMA Write] Request device [00:02.0] fault addr 17f000 [fault reason 23] Unknown Aug 29 19:04:40 scapa kernel: [ 49.813478] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Aug 29 19:04:48 scapa kernel: [ 57.804899] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Aug 29 19:04:56 scapa kernel: [ 65.799728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Aug 29 19:04:56 scapa kernel: [ 65.882208] wlan0: deauthenticating from 14:0c:76:bf:71:fc by local choice (Reason: 3=DEAUTH_LEAVING) Aug 29 19:04:56 scapa kernel: [ 65.902446] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready Aug 29 19:04:57 scapa kernel: [ 66.510770] DMAR: DRHD: handling fault status reg 3 Aug 29 19:04:57 scapa kernel: [ 66.510778] DMAR: [DMA Write] Request device [00:02.0] fault addr fffc6000 [fault reason 23] Unknown Aug 29 19:04:57 scapa kernel: [ 66.510781] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:57 scapa kernel: [ 66.510784] DMAR: [DMA Write] Request device [00:02.0] fault addr 4d000 [fault reason 23] Unknown Aug 29 19:04:57 scapa kernel: [ 66.510788] DMAR: DRHD: handling fault status reg 2 Aug 29 19:04:57 scapa kernel: [ 66.510791] DMAR: [DMA Write] Request device [00:02.0] fault addr 51000 [fault reason 23] Unknown Aug 29 19:04:57 scapa kernel: [ 66.510802] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:02 scapa kernel: [ 71.509221] dmar_fault: 6733586 callbacks suppressed Aug 29 19:05:02 scapa kernel: [ 71.509222] DMAR: DRHD: handling fault status reg 3 Aug 29 19:05:02 scapa kernel: [ 71.509230] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff4b000 [fault reason 23] Unknown Aug 29 19:05:02 scapa kernel: [ 71.509233] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:02 scapa kernel: [ 71.509236] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff53000 [fault reason 23] Unknown Aug 29 19:05:02 scapa kernel: [ 71.509239] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:02 scapa kernel: [ 71.509241] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff57000 [fault reason 23] Unknown Aug 29 19:05:02 scapa kernel: [ 71.509244] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:07 scapa kernel: [ 76.511769] dmar_fault: 6751341 callbacks suppressed Aug 29 19:05:07 scapa kernel: [ 76.511770] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:07 scapa kernel: [ 76.511775] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6bc000 [fault reason 23] Unknown Aug 29 19:05:07 scapa kernel: [ 76.511778] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:07 scapa kernel: [ 76.511781] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6c3000 [fault reason 23] Unknown Aug 29 19:05:07 scapa kernel: [ 76.511784] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:07 scapa kernel: [ 76.511787] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6c8000 [fault reason 23] Unknown Aug 29 19:05:07 scapa kernel: [ 76.511790] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:12 scapa kernel: [ 81.514717] dmar_fault: 6802731 callbacks suppressed Aug 29 19:05:12 scapa kernel: [ 81.514718] DMAR: DRHD: handling fault status reg 3 Aug 29 19:05:12 scapa kernel: [ 81.514722] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade03000 [fault reason 23] Unknown Aug 29 19:05:12 scapa kernel: [ 81.514725] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:12 scapa kernel: [ 81.514728] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade0a000 [fault reason 23] Unknown Aug 29 19:05:12 scapa kernel: [ 81.514731] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:12 scapa kernel: [ 81.514733] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade0e000 [fault reason 23] Unknown Aug 29 19:05:12 scapa kernel: [ 81.514736] DMAR: DRHD: handling fault status reg 2 Aug 29 19:05:12 scapa kernel: [ 81.794708] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0 Aug 29 19:05:20 scapa kernel: [ 89.793873] i915 0000:00:02.0: Resetting chip for hang on rcs0 Aug 29 19:05:20 scapa kernel: [ 89.793938] i915 0000:00:02.0: GPU recovery failed Unfortunately because of the soft freeze I didn't have a chance to recover /sys/class/drm/card0/error. But the hang happened pretty soon after boot on my broadwell CPU, pretty much as soon as I enabled the external screen when logged on the desktop. Hello, I can also confirm that I am getting this on 4.18.5. I reproduce it somewhat consistently when I suspend my laptop and then resume it. I managed to reproduce again (same MO), and this time I managed to get /sys/class/drm/card0/error. Before posting it here, can someone confirm it doesn't have any personal information in there (there is some binary in there so I'd prefer some confirmation). Workaround for gpu hangs that comes with DMAR ERROR is to add intel_iommu=igfx_off kernel option. IF gpu hangs for some other reason please file a new bug if there isn't a existing bug. Closing this bug, as the original issue is resolved with a work around fix. create a new bug for GPU hangs (other than DMAR ERROR case) When you create, ensure that issue is with latest drmtip. (https://cgit.freedesktop.org/drm-tip) Attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M. (In reply to Lakshmi from comment #96) > Workaround for gpu hangs that comes with DMAR ERROR is to add > intel_iommu=igfx_off kernel option. Hi Lakshmi, we know about igfx_off since a long time, but it's not a fix, it's a workaround. We were told the bug was *fixed* in recent kernels (4.18+), but it doesn't seem to be the case, at least on Broadwell. > Closing this bug, as the original issue is resolved with a work around fix. I have to admit I'm disappointed by this. Not surprised though, I was kind-of expecting this, it's just took quite a lot of time to finally admit there would be no software fix. I was told the same for my i7-640LM, is there a chance DMAR will one day work fine with iGPU or should be enforce igfx_off by default in the Linux kernel? (In reply to Lakshmi from comment #96) > Workaround for gpu hangs that comes with DMAR ERROR is to add > intel_iommu=igfx_off kernel option. Hello Lakshmi, While use of intel_iommu=igfx_off works around the GPU freeze issues, it has major drawbacks. For example it makes Intel VT-d _unusable_ with VirtualBox. Also, stating that a workaround is a fix for a bug is a contradiction IMO. I still consider that this bug is present and appeal to either keep the issue open until it has been fixed and the fix has been validated, or state outright that there will be no fix and VT-d should be considered buggy on Broadwell. (In reply to Lakshmi from comment #96) > Workaround for gpu hangs that comes with DMAR ERROR is to add > intel_iommu=igfx_off kernel option. > > IF gpu hangs for some other reason please file a new bug if there isn't a > existing bug. > > Closing this bug, as the original issue is resolved with a work around fix. Unbelievable to hear such thing from an Intel employer. Please refer to the docs: https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt ``` Graphics Problems? ------------------ If you encounter issues with graphics devices, you can try adding option intel_iommu=igfx_off to turn off the integrated graphics engine. If this fixes anything, please ensure you file a bug reporting the problem. ``` "to turn off the integrated graphics engine" is what the doc says. "please ensure you file a bug" is what the doc says. This should be re-opened. Before Linux 4.18 I could reproduce this bug on Broadwell and Skylake hardware. On Linux 4.18 I can't reproduce this on Skylake anymore (after months of testing). I can't test on Broadwell as my machine didn't lived as long as this bug. I don't know which exact commits fixed this issue on Skylake and why it doesn't work on Broadwell as other people reported. BTW: on Linux 4.19rc1-3 iommu is broken again for graphics but this is unrelated issue which is being worked on. I can confirm that the problem still affects Intel Broadwell 5500 (gen 8) that needs the workaround in kernel 4.19.2 using gentoo-sources and vanilla-sources in Gentoo Linux to start "gdm" without suffering a Intel GPU hang. dmesg clearly shows that the problem with DMAR hangs the Intel GPU. (In reply to Ernest Hurtado from comment #101) > Before Linux 4.18 I could reproduce this bug on Broadwell and Skylake > hardware. On Linux 4.18 I can't reproduce this on Skylake anymore (after > months of testing). I can't test on Broadwell as my machine didn't lived as > long as this bug. I don't know which exact commits fixed this issue on > Skylake and why it doesn't work on Broadwell as other people reported. > > BTW: on Linux 4.19rc1-3 iommu is broken again for graphics but this is > unrelated issue which is being worked on. Hi, Ernest. At least in GentooLinux, I did not have the problem with Broadwell in kernel 4.18 and prior versions since 4.14.x. The machine is a Dell Inspiron 5540 laptop with Intel integrated graphics and a discrete AMD Topaz GPU. To get both providers working with PRIME, I had to have a conf file in /etc/X11/xorg.d/ declaring DRI "3" for the Intel xf86 video driver. The hang up problem with the Intel GPU has appeared for me since the 4.19.1 linux kernel. Now, I can get "gdm" working without any specific xorg driver conf file in /etc/X11/xorg.d/ and the workaround "intel_iommu=igfx_off". I can use both graphic cards with PRIME although the "modesetting" xf86 video driver doesn't yet work and I have to keep the old Intel xf86 video driver. This behaviour with previous kernels makes me suspect that the iommu breakage in 4.19rc1-3 you mention has something to do with the problem in Broadwell (at least in my laptop). Miguel Ángel (In reply to miguelramos from comment #103) > This behaviour with previous kernels makes me suspect that the iommu > breakage in 4.19rc1-3 you mention has something to do with the problem in > Broadwell (at least in my laptop). > > Miguel Ángel In the issue I mentioned the display wasn't working at all at boot (black screen), which is fixed in 4.19 stable release so I think this wasn't related to your problems. Created attachment 142592 [details]
error file after DMAR error hangs GPU trying to start "gdm" with 4.19.3
This is the /sys/class/drm/card0/error file I get after GPU reset fails trying to start gnome-shell gdm
Created attachment 142593 [details]
DMAR error in journal trying to start gnome-shell gdm
The attachment here contains the part of the journal showing the DMAR error when the gentoo-linux 4.19.3 boots without the "intel_iommu=igfx_off" option and I try to start gdm.
(In reply to Ernest Hurtado from comment #104) > (In reply to miguelramos from comment #103) > > This behaviour with previous kernels makes me suspect that the iommu > > breakage in 4.19rc1-3 you mention has something to do with the problem in > > Broadwell (at least in my laptop). > > > > Miguel Ángel > > In the issue I mentioned the display wasn't working at all at boot (black > screen), which is fixed in 4.19 stable release so I think this wasn't > related to your problems. You are right Ernest. And because of that, I conjecture that the problem that appears in my case for the first time in kernel 4.19 might be related to the specific changes implemented in the IOMMU support in kernel 4.19. M. A. *** Bug 107921 has been marked as a duplicate of this bug. *** miguelramos, do you still the issue with kernel 4.19.5 or later? Ping miguelramos? Created attachment 143757 [details] attachment-26232-0.html Francesco, I think this is an HW issue, except setting intel_iommu=igfx_off we can’t do much about it. Few from the community are not happy with that and then reopened the issue. Joonas might know more details. At the moment, we can’t much about this bug, just letting you know the background☺ Lakshmi. From: intel-gfx-bugs [mailto:intel-gfx-bugs-bounces@lists.freedesktop.org] On Behalf Of bugzilla-daemon@freedesktop.org Sent: Friday, March 22, 2019 11:03 AM To: intel-gfx-bugs@lists.freedesktop.org Subject: [Bug 89360] [bdw-u iommu] DMAR error -> GPU hang Comment # 110<https://bugs.freedesktop.org/show_bug.cgi?id=89360#c110> on bug 89360<https://bugs.freedesktop.org/show_bug.cgi?id=89360> from Francesco Balestrieri<mailto:francesco.balestrieri@intel.com> Ping miguelramos? ________________________________ You are receiving this mail because: * You are on the CC list for the bug. * You are the QA Contact for the bug. --------------------------------------------------------------------- Intel Finland Oy Registered Address: PL 281, 00181 Helsinki Business Identity Code: 0357606 - 4 Domiciled in Helsinki This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. (In reply to miguelramos from comment #102) > I can confirm that the problem still affects Intel Broadwell 5500 (gen 8) > that needs the workaround in kernel 4.19.2 using gentoo-sources and > vanilla-sources in Gentoo Linux to start "gdm" without suffering a Intel GPU > hang. > > dmesg clearly shows that the problem with DMAR hangs the Intel GPU. So I tried removing igfx_off in (Debian) 4.19 kernel (BDW) and it sure still doesn't work. It might be a different bug than before but I still have to use that option. Was that with kernel 4.19.5 or later? (In reply to Francesco Balestrieri from comment #113) > Was that with kernel 4.19.5 or later? My last test was with 4.19.28 (In reply to Francesco Balestrieri from comment #109) > miguelramos, do you still the issue with kernel 4.19.5 or later? Yes, exactly the same with kernel 5.0.10 As soon as gdm is started, laptop hangs completely. I still need to pass intel_iommu=igfx_off to workaround it But I don't know how to get the updated logs... because system hungs completely as soon as I try to launch GDM and I hit the bug :/ Hello, i have description: Notebook product: LIFEBOOK E8420 vendor: FUJITSU SIEMENS version: E84__ width: 32 bits capabilities: smbios-2.4 dmi-2.4 with 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller]) Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics Controller Flags: bus master, fast devsel, latency 0, IRQ 16 Memory at f2000000 (64-bit, non-prefetchable) [size=4M] Memory at d0000000 (64-bit, prefetchable) [size=256M] I/O ports at 1800 [size=8] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: <access denied> Kernel driver in use: i915 Kernel modules: i915 00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics Controller Flags: bus master, fast devsel, latency 0 Memory at f2400000 (64-bit, non-prefetchable) [size=1M] Capabilities: <access denied> running 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:12 UTC 2019 i686 i686 i686 GNU/Linux My System turns reproducable into a piece of brick every 2nd boot when switching to inteldrmfb from VESA ?VGA? or so. Sometimes the CLI Screen gets a bit cluttered or blacks out or just heats up the room. I also noticed: KiCAD schematics gets cluttered with mousecross until refresh. I can not use a secondary display by hdmi and xrandr. I tried to find some loggings but since it happens during bootup i did not find anything to investigate further details. If i can support to investigate / fix the annoying shit please let me know what / how to do as i really would like to use my big stationary screen for cad. Sincerely Reiner (In reply to Toroid from comment #117) > Hello, > > i have > > description: Notebook > product: LIFEBOOK E8420 > vendor: FUJITSU SIEMENS > version: E84__ > width: 32 bits > capabilities: smbios-2.4 dmi-2.4 > > with > > 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset > Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller]) > Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics > Controller > Flags: bus master, fast devsel, latency 0, IRQ 16 > Memory at f2000000 (64-bit, non-prefetchable) [size=4M] > Memory at d0000000 (64-bit, prefetchable) [size=256M] > I/O ports at 1800 [size=8] > [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] > Capabilities: <access denied> > Kernel driver in use: i915 > Kernel modules: i915 > > 00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset > Integrated Graphics Controller (rev 07) > Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics > Controller > Flags: bus master, fast devsel, latency 0 > Memory at f2400000 (64-bit, non-prefetchable) [size=1M] > Capabilities: <access denied> > > > running > > 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:12 UTC 2019 i686 i686 i686 > GNU/Linux > > My System turns reproducable into a piece of brick every 2nd boot when > switching to inteldrmfb from VESA ?VGA? or so. > Sometimes the CLI Screen gets a bit cluttered or blacks out or just heats up > the room. > > I also noticed: > KiCAD schematics gets cluttered with mousecross until refresh. > I can not use a secondary display by hdmi and xrandr. > > I tried to find some loggings but since it happens during bootup i did not > find anything to investigate further details. > > If i can support to investigate / fix the annoying shit please let me know > what / how to do as i really would like to use my big stationary screen for > cad. > > Sincerely Reiner Hi, This looks like a different issue than the original bug report. Can you please verify the issue with drmtip (https://cgit.freedesktop.org/drm-tip). If the problem persists create a new bug and and attach dmesg log from boot with kernel parameter drm.debug=0x1e log_buf_len=4M. Thanks but this looks to me like anything but nothing to follow up, check, change or an info at all that might be handy for what ever. I am from networking and project management but not at all a coder except some shell and C, Fortran .... Doing some systemconfig is still fine to me, but not reconfig a package and maybe lockup my system which i really need. I just try with the kernel dmesg => log to see if i can grab some logging. Blacklisting inteldrmfb doesnt help by the way, system still locks up with a black screen. Created attachment 144746 [details]
attachment-15805-0.html
Hi,
I'm out of office during Jul 17 - Jul 19. No email access.
For Wind River issues, please consider talking to Pragyan Pathi. My mobile(+86 13911141692) is reachable if you need immediate response.
Thanks,
Yunying
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/21. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 113867 [details] kernel log The system randomly freezes and doesn't react to anything afterwards. Not even the magic keys can reboot the system. Processor model is Intel Core-i7 5500U with the integrated GPU. Kernel version is 4.0.0-rc1, which is required to even get X / gdm working with the system. I've attached the kernel log messages which shows an instance of this problem. Please request any information needed and I'll happily provide it. steveeJ