Bug 89360 - [bdw-u iommu] DMAR error -> GPU hang
Summary: [bdw-u iommu] DMAR error -> GPU hang
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: All Linux (All)
: high critical
Assignee: Joonas Lahtinen
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 90091 90823 91076 91152 91458 91633 91764 92531 92905 94229 94780 94959 98309 98728 99308 99964 100203 100209 101236 101238 101694 101785 104802 104929 105823 107921 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-02-27 13:45 UTC by Stefan Junker
Modified: 2019-11-29 17:12 UTC (History)
54 users (show)

See Also:
i915 platform: BDW
i915 features: GEM/Other, GPU hang


Attachments
kernel log (379.31 KB, text/plain)
2015-02-27 13:45 UTC, Stefan Junker
no flags Details
bpowers' dmesg (79.16 KB, text/plain)
2015-03-04 14:40 UTC, Bobby Powers
no flags Details
bpowers' GPU crash dump (2.66 MB, text/plain)
2015-03-04 14:41 UTC, Bobby Powers
no flags Details
kernel log on 4.2.2 (122.76 KB, text/plain)
2015-10-02 13:36 UTC, Yves-Alexis
no flags Details
kernel log on 4.2.2 with execlists disabled (69.50 KB, text/plain)
2015-10-02 13:38 UTC, Yves-Alexis
no flags Details
Resetting chip after gpu hang (4.85 KB, text/plain)
2016-03-21 20:12 UTC, crow
no flags Details
Kernel log with i915 hang (4.43 KB, text/plain)
2016-04-06 18:36 UTC, Sergi Barroso
no flags Details
Error from intel card (377.59 KB, text/plain)
2016-04-06 18:37 UTC, Sergi Barroso
no flags Details
drm_error crash dump of brots ThinkPad X250 (272.18 KB, text/plain)
2016-10-06 09:04 UTC, Michael Groh
no flags Details
/sys/class/drm/card0/error Intel HD Graphics 5500 (714.46 KB, text/x-log)
2017-04-03 20:53 UTC, Ernest Hurtado
no flags Details
dmesg caught after GPU hang/reset with 4.13.3 (4.95 KB, text/plain)
2017-10-08 12:55 UTC, Ansgar Hegerfeld
no flags Details
DRM error dump caught after GPU hang/reset with 4.13.3 (38.64 KB, text/plain)
2017-10-08 12:56 UTC, Ansgar Hegerfeld
no flags Details
crash dump of Intel Iris Graphics 6100 (28.53 KB, text/plain)
2018-04-08 12:55 UTC, Giovanni Grieco
no flags Details
error file after DMAR error hangs GPU trying to start "gdm" with 4.19.3 (37.84 KB, text/plain)
2018-11-23 13:15 UTC, miguelramos
no flags Details
DMAR error in journal trying to start gnome-shell gdm (5.69 KB, text/plain)
2018-11-23 13:49 UTC, miguelramos
no flags Details
attachment-26232-0.html (7.91 KB, text/html)
2019-03-22 09:09 UTC, Lakshmi
no flags Details
attachment-15805-0.html (2.97 KB, text/html)
2019-07-09 14:48 UTC, yunying sun
no flags Details

Description Stefan Junker 2015-02-27 13:45:30 UTC
Created attachment 113867 [details]
kernel log

The system randomly freezes and doesn't react to anything afterwards. Not even the magic keys can reboot the system.

Processor model is Intel Core-i7 5500U with the integrated GPU.
Kernel version is 4.0.0-rc1, which is required to even get X / gdm working with the system.

I've attached the kernel log messages which shows an instance of this problem.
Please request any information needed and I'll happily provide it.

steveeJ
Comment 1 Stefan Junker 2015-02-28 13:00:04 UTC
Short version
---
adding 'intel_iommu=igfx_off' helped


Long version
---
I've tried many things to resolve this issue, from kernel reconfiguration to installing mesa, libdrm, intel drivers from latest repository masters, which all didn't help. I reverted back to the most recent releases of the packages.

From an older forum entry somewhere on the webs I found that this could be related to virtualization techologies and memory remapping, so I added the following arguments to my kernel commandline: 'intel_iommu=igfx_off'

Ever since (about 10-15 hours of very active usage) I haven't had a single freeze.

I still think this is not normal behavior, since turning off iommu for the GPU can't be the right or necessary thing to do.
Comment 2 Bobby Powers 2015-03-04 14:40:12 UTC
Created attachment 113998 [details]
bpowers' dmesg
Comment 3 Bobby Powers 2015-03-04 14:41:09 UTC
Created attachment 113999 [details]
bpowers' GPU crash dump
Comment 4 Bobby Powers 2015-03-04 14:44:33 UTC
I'm also seeing this, on an i7-5600U.  Kernel 4.0.0-rc2+, xf86-video-intel from a few days ago (9fb8154), mesa 10.5-rc3.

I turned off VT-d in the BIOS, and haven't seen any issues since.  I've attached both my dmesg + GPU crash dump from one of the times this happened.
Comment 5 Bobby Powers 2015-03-04 14:56:33 UTC
some additional info - my BIOS (3rd gen x1 carbon) apparently marks x2apic as broken.  I booted a number of times with intremap=no_x2apic_optout on the kernel command line, and saw what steveej mentioned: a hard freeze.

The system did have the foresight to save the dmesg into the EFI pstore. I have those logs if they are useful.

After removing no_x2apic_optout, the kernel "Enabled IRQ remapping in xapic mode", and under xapic some of the time the kernel was able to recover/reset the chip to an ok-enough state that I could save dmesg and grap the GPU dump from /sys/class/drm/card0/error.
Comment 6 Chris Wilson 2015-06-11 09:51:02 UTC
*** Bug 90823 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2015-06-11 09:51:18 UTC
*** Bug 90091 has been marked as a duplicate of this bug. ***
Comment 8 Chris Wilson 2015-06-24 09:41:16 UTC
*** Bug 91076 has been marked as a duplicate of this bug. ***
Comment 9 Chris Wilson 2015-06-24 09:42:03 UTC
In bug 91076, the hangs are anything but random. Simply initialising execlists is enough for the GPU to die.
Comment 10 Chris Wilson 2015-07-01 07:43:15 UTC
*** Bug 91152 has been marked as a duplicate of this bug. ***
Comment 11 Chris Wilson 2015-07-01 07:43:53 UTC
Quick experiment from bug 91152 indicates that this is a problem with or without execlists enabled.
Comment 12 Chris Wilson 2015-07-25 12:01:43 UTC
*** Bug 91458 has been marked as a duplicate of this bug. ***
Comment 13 Maarten Lankhorst 2015-08-11 10:40:03 UTC
I've noticed the same problem here. I also get frequent DMAR errors without hangs during ubuntu's fade to black animation for dpms off.
Comment 14 Matthias Nagel 2015-08-11 17:41:36 UTC
I found the same problem as in comment #4. If I disable VT-d in the BIOS the crashes disappear. But then I get random segmentation faults from GCC if I try to compile QtWebKit (N.b: I have gentoo and compile all packages by myself.) Hence, I have two options

(1) Disable VT-d for daily work such that i915 does not crash
(2) Enable VT-d and only boot into text console mode if I need to compile QtWebKit

My hardware is

Lenovo Thinkpad X1 Carbon (3rd generation, type 20BT)
Processor Intel Core i5-5200U
Memory: 8 GB PC3-12800L
BIOS: Phoenix, ver. 1.09, 7/22/2015

Interestingly, MemTest86 does not report any memory error in either case (with and without VT-d)
Comment 15 Maarten Lankhorst 2015-08-12 08:54:36 UTC
Sounds more like an actual corruption prevented by DMAR then..
Comment 16 Chris Wilson 2015-08-14 10:35:05 UTC
*** Bug 91633 has been marked as a duplicate of this bug. ***
Comment 17 Chris Wilson 2015-08-27 15:14:16 UTC
*** Bug 91764 has been marked as a duplicate of this bug. ***
Comment 18 siflfran 2015-08-27 16:56:09 UTC
I'm not sure if 91764 really is a duplicate of this bug. intel_iommu=igfx_off doesn't help in my case.
Comment 19 klondike 2015-09-16 15:48:15 UTC
In my case this issue (googling for the opcode hanging the GPU lead me to this bug) was solved by disabling the EFI Framebuffer on the kernel configuration.

If the devs want I can open a second bug to request the Intel GFX drivers taking over from early framebuffers (for example EFI or VGA) to prevent my particular issues.
Comment 20 David Woodhouse 2015-10-02 12:36:33 UTC
Looking at the first two dumps, this looks like it might be a simple driver bug. The driver forgets to use the DMA API and wrongly just hands a physical address to the device. The device does DMA to that invalid address, takes a well-deserved fault, and is subsequently unhappy.

The faulting addresses do not look like addresses which would be given out as virtual DMA addresses by the DMA API. Such addresses would typically start at 0xfffff000 and grow downwards.
Comment 21 Yves-Alexis 2015-10-02 13:36:45 UTC
Created attachment 118605 [details]
kernel log on 4.2.2

I reported bug #90091 initially, which was marked as duplicate of this one.

I've tried to reproduce with 4.2.2, and it still happens. Two dmesgs are attached, the second one with i915.enable_execlists=0. In the latter case I only have the DMAR fault but not the GPU hang.
Comment 22 Yves-Alexis 2015-10-02 13:38:58 UTC
Created attachment 118606 [details]
kernel log on 4.2.2 with execlists disabled

Second log, with execlists disabled. Actually in the meantime I managed to get a GPU hang.
Comment 23 Yves-Alexis 2015-10-02 13:40:43 UTC
Offending userland code looks like chromium, which I guess uses the DDX intensively.
Comment 24 Chris Wilson 2015-10-19 10:14:47 UTC
*** Bug 92531 has been marked as a duplicate of this bug. ***
Comment 25 Josh Glover 2015-11-17 08:39:44 UTC
I have the same problem, and Google Chrome (as in the binary release, not Chromium) seems to trigger it. Disabling 3D rendering in X has helped, as now I get a crash and hang every couple of weeks instead of around once a day.

Here's my uname output, in case it helps:

Linux laurana 4.0.5-gentoo #5 SMP Tue Sep 22 09:45:32 CEST 2015 x86_64 Intel(R) Core(TM) i7-5600U CPU @ 2.60GHz GenuineIntel GNU/Linux

I'll try the 'intel_iommu=igfx_off' kernel option and see if that improves matters. Failing that, I'll disable VT-d, unless it is required by Docker.

If I can provide any helpful info, please let me know and I'll be happy to do so.
Comment 26 cprigent 2015-11-24 17:44:42 UTC
*** Bug 92905 has been marked as a duplicate of this bug. ***
Comment 27 Yves-Alexis 2016-01-11 16:31:33 UTC
Is there anything missing from us users about this? What can we do to push this forward?
Comment 28 Chris Wilson 2016-02-20 16:03:59 UTC
*** Bug 94229 has been marked as a duplicate of this bug. ***
Comment 29 crow 2016-03-21 20:12:18 UTC
Created attachment 122465 [details]
Resetting chip after gpu hang

I have here an Dell XPS 13 9350 2016 (Intel Core i7-6560U ) and have installed Fedora 23 (GNOME), currently on Kernel 4.4.5-300.fc23.x86_64.

As soon I open Chrome or Firefox and let’s say open YouTube after few second whole laptop will hang, totally unresponsible. Need to press Power ON button for few seconds to power of the device.
Once I was able to get log and its attached.

I added "intel_iommu=igfx_off" to /etc/default/grub and regenerated GRUB ( grub2-mkconfig -o /boot/efi/EFI/fedora/grub.cfg) but it does not solve the problem here.

If any more info is needed I would be glad to provide them to you. I hope I will be able to use my new Notebook without issue.
Comment 30 crow 2016-03-21 20:26:39 UTC
Comment on attachment 122465 [details]
Resetting chip after gpu hang

>[Mon Mar 21 18:02:39 2016] i915 0000:00:02.0: Invalid ROM contents
>[Mon Mar 21 18:04:52 2016] nf_conntrack: automatic helper assignment is deprecated and it will be removed soon. Use the iptables CT target to attach helpers instead.
>[Mon Mar 21 18:05:54 2016] [drm] stuck on render ring
>[Mon Mar 21 18:05:54 2016] [drm] GPU HANG: ecode 9:0:0x85dfbfff, in chrome [3398], reason: Ring hung, action: reset
>[Mon Mar 21 18:05:54 2016] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
>[Mon Mar 21 18:05:54 2016] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
>[Mon Mar 21 18:05:54 2016] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
>[Mon Mar 21 18:05:54 2016] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
>[Mon Mar 21 18:05:54 2016] [drm] GPU crash dump saved to /sys/class/drm/card0/error
>[Mon Mar 21 18:05:54 2016] ------------[ cut here ]------------
>[Mon Mar 21 18:05:54 2016] WARNING: CPU: 1 PID: 423 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]()
>[Mon Mar 21 18:05:54 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
>[Mon Mar 21 18:05:54 2016] Modules linked in:
>[Mon Mar 21 18:05:54 2016]  rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_filter ebtable_nat ebtables ip6table_security ip6table_mangle ip6table_raw ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables iptable_security iptable_mangle iptable_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack bnep hid_multitouch snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal coretemp snd_soc_skl dell_led snd_soc_skl_ipc snd_hda_ext_core kvm_intel snd_soc_sst_ipc snd_soc_sst_dsp snd_soc_core kvm snd_hda_codec_realtek vfat snd_hda_codec_generic fat snd_compress snd_pcm_dmaengine ac97_bus
>[Mon Mar 21 18:05:54 2016]  dell_wmi sparse_keymap dw_dmac_core i2c_designware_platform dell_laptop i2c_designware_core snd_hda_intel irqbypass dcdbas brcmfmac snd_hda_codec brcmutil snd_hda_core snd_hwdep cfg80211 snd_seq snd_seq_device uvcvideo rtsx_pci_ms snd_pcm videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 videobuf2_core v4l2_common joydev snd_timer videodev snd mei_me btusb i2c_i801 soundcore btrtl idma64 media mei shpchp intel_lpss_pci hci_uart wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi intel_lpss processor_thermal_device int3403_thermal int340x_thermal_zone int3400_thermal intel_soc_dts_iosf acpi_als iosf_mbi acpi_thermal_rel kfifo_buf acpi_pad industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core
>[Mon Mar 21 18:05:54 2016]  crct10dif_pclmul crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper drm serio_raw nvme rtsx_pci i2c_hid video fjes
>[Mon Mar 21 18:05:54 2016] CPU: 1 PID: 423 Comm: kworker/1:3 Not tainted 4.4.5-300.fc23.x86_64 #1
>[Mon Mar 21 18:05:54 2016] Hardware name: Dell Inc. XPS 13 9350/XXXXX, BIOS 1.2.3 01/08/2016
>[Mon Mar 21 18:05:54 2016] Workqueue: events intel_mmio_flip_work_func [i915]
>[Mon Mar 21 18:05:54 2016]  0000000000000286 0000000093832e83 ffff880272cb7d20 ffffffff813b54ae
>[Mon Mar 21 18:05:54 2016]  ffff880272cb7d68 ffffffffa01f9de8 ffff880272cb7d58 ffffffff810a40f2
>[Mon Mar 21 18:05:54 2016]  ffff880215ebf0c0 ffff880280c96600 ffff880280c9b000 0000000000000040
>[Mon Mar 21 18:05:54 2016] Call Trace:
>[Mon Mar 21 18:05:54 2016]  [<ffffffff813b54ae>] dump_stack+0x63/0x85
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80
>[Mon Mar 21 18:05:54 2016]  [<ffffffffa01937d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915]
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810bc596>] process_one_work+0x156/0x430
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810bc8be>] worker_thread+0x4e/0x450
>[Mon Mar 21 18:05:54 2016]  [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810c2648>] kthread+0xd8/0xf0
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
>[Mon Mar 21 18:05:54 2016]  [<ffffffff817a088f>] ret_from_fork+0x3f/0x70
>[Mon Mar 21 18:05:54 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
>[Mon Mar 21 18:05:54 2016] ---[ end trace b5a5acfc195b296b ]---
>[Mon Mar 21 18:05:54 2016] drm/i915: Resetting chip after gpu hang
>[Mon Mar 21 18:05:56 2016] [drm] RC6 on


[Mon Mar 21 21:23:31 2016] [drm] stuck on render ring
[Mon Mar 21 21:23:31 2016] [drm] GPU HANG: ecode 9:0:0x85df9fff, in chrome [5196], reason: Ring hung, action: reset
[Mon Mar 21 21:23:31 2016] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Mon Mar 21 21:23:31 2016] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Mon Mar 21 21:23:31 2016] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Mon Mar 21 21:23:31 2016] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[Mon Mar 21 21:23:31 2016] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[Mon Mar 21 21:23:31 2016] ------------[ cut here ]------------
[Mon Mar 21 21:23:31 2016] WARNING: CPU: 2 PID: 120 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]()
[Mon Mar 21 21:23:31 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
[Mon Mar 21 21:23:31 2016] Modules linked in:
[Mon Mar 21 21:23:31 2016]  rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle bnep hid_multitouch intel_rapl snd_soc_skl x86_pkg_temp_thermal snd_soc_skl_ipc coretemp snd_hda_ext_core snd_soc_sst_ipc snd_hda_codec_hdmi kvm_intel snd_soc_sst_dsp dell_led kvm snd_soc_core brcmfmac vfat snd_hda_codec_realtek fat snd_compress snd_hda_codec_generic snd_pcm_dmaengine ac97_bus dw_dmac_core brcmutil dell_wmi dell_laptop i2c_designware_platform snd_hda_intel i2c_designware_core sparse_keymap dcdbas cfg80211 irqbypass snd_hda_codec
[Mon Mar 21 21:23:31 2016]  snd_hda_core snd_hwdep snd_seq uvcvideo rtsx_pci_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 snd_seq_device snd_pcm joydev btusb videobuf2_core btrtl v4l2_common videodev snd_timer snd mei_me media i2c_i801 soundcore mei hci_uart idma64 processor_thermal_device shpchp intel_lpss_pci intel_soc_dts_iosf iosf_mbi wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi int3403_thermal intel_lpss int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad acpi_als kfifo_buf industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme serio_raw rtsx_pci i2c_hid video fjes
[Mon Mar 21 21:23:31 2016] CPU: 2 PID: 120 Comm: kworker/2:1 Not tainted 4.4.5-300.fc23.x86_64 #1
[Mon Mar 21 21:23:31 2016] Hardware name: Dell Inc. XPS 13 9350/0JXC1H, BIOS 1.2.3 01/08/2016
[Mon Mar 21 21:23:31 2016] Workqueue: events intel_mmio_flip_work_func [i915]
[Mon Mar 21 21:23:31 2016]  0000000000000286 000000005e30db05 ffff880273f0fd20 ffffffff813b54ae
[Mon Mar 21 21:23:31 2016]  ffff880273f0fd68 ffffffffa01f4de8 ffff880273f0fd58 ffffffff810a40f2
[Mon Mar 21 21:23:31 2016]  ffff8802318a8140 ffff880280d16600 ffff880280d1b000 0000000000000080
[Mon Mar 21 21:23:31 2016] Call Trace:
[Mon Mar 21 21:23:31 2016]  [<ffffffff813b54ae>] dump_stack+0x63/0x85
[Mon Mar 21 21:23:31 2016]  [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
[Mon Mar 21 21:23:31 2016]  [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80
[Mon Mar 21 21:23:31 2016]  [<ffffffffa018e7d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915]
[Mon Mar 21 21:23:31 2016]  [<ffffffff810bc596>] process_one_work+0x156/0x430
[Mon Mar 21 21:23:31 2016]  [<ffffffff810bc8be>] worker_thread+0x4e/0x450
[Mon Mar 21 21:23:31 2016]  [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00
[Mon Mar 21 21:23:31 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
[Mon Mar 21 21:23:31 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
[Mon Mar 21 21:23:31 2016]  [<ffffffff810c2648>] kthread+0xd8/0xf0
[Mon Mar 21 21:23:31 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
[Mon Mar 21 21:23:31 2016]  [<ffffffff817a088f>] ret_from_fork+0x3f/0x70
[Mon Mar 21 21:23:31 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
[Mon Mar 21 21:23:31 2016] ---[ end trace 2e1339ae2448560b ]---
[Mon Mar 21 21:23:31 2016] drm/i915: Resetting chip after gpu hang
[Mon Mar 21 21:23:33 2016] [drm] RC6 on
[Mon Mar 21 21:23:47 2016] [drm] stuck on render ring
[Mon Mar 21 21:23:47 2016] [drm] GPU HANG: ecode 9:0:0x85dfbfff, in chrome [5273], reason: Ring hung, action: reset
[Mon Mar 21 21:23:47 2016] ------------[ cut here ]------------
[Mon Mar 21 21:23:47 2016] WARNING: CPU: 2 PID: 120 at drivers/gpu/drm/i915/intel_display.c:11289 intel_mmio_flip_work_func+0x387/0x3d0 [i915]()
[Mon Mar 21 21:23:47 2016] WARN_ON(__i915_wait_request(mmio_flip->req, mmio_flip->crtc->reset_counter, false, NULL, &mmio_flip->i915->rps.mmioflips))
[Mon Mar 21 21:23:47 2016] Modules linked in:
[Mon Mar 21 21:23:47 2016]  rfcomm fuse cmac xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_netbios_ns nf_conntrack_broadcast xt_conntrack ip_set nfnetlink ebtable_nat ebtable_filter ebtable_broute bridge stp llc ebtables ip6_tables iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle bnep hid_multitouch intel_rapl snd_soc_skl x86_pkg_temp_thermal snd_soc_skl_ipc coretemp snd_hda_ext_core snd_soc_sst_ipc snd_hda_codec_hdmi kvm_intel snd_soc_sst_dsp dell_led kvm snd_soc_core brcmfmac vfat snd_hda_codec_realtek fat snd_compress snd_hda_codec_generic snd_pcm_dmaengine ac97_bus dw_dmac_core brcmutil dell_wmi dell_laptop i2c_designware_platform snd_hda_intel i2c_designware_core sparse_keymap dcdbas cfg80211 irqbypass snd_hda_codec
[Mon Mar 21 21:23:47 2016]  snd_hda_core snd_hwdep snd_seq uvcvideo rtsx_pci_ms videobuf2_vmalloc memstick videobuf2_memops videobuf2_v4l2 snd_seq_device snd_pcm joydev btusb videobuf2_core btrtl v4l2_common videodev snd_timer snd mei_me media i2c_i801 soundcore mei hci_uart idma64 processor_thermal_device shpchp intel_lpss_pci intel_soc_dts_iosf iosf_mbi wmi btbcm btqca btintel bluetooth pinctrl_sunrisepoint pinctrl_intel rfkill intel_lpss_acpi int3403_thermal intel_lpss int340x_thermal_zone int3400_thermal acpi_thermal_rel acpi_pad acpi_als kfifo_buf industrialio tpm_tis tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme serio_raw rtsx_pci i2c_hid video fjes
[Mon Mar 21 21:23:47 2016] CPU: 2 PID: 120 Comm: kworker/2:1 Tainted: G        W       4.4.5-300.fc23.x86_64 #1
[Mon Mar 21 21:23:47 2016] Hardware name: Dell Inc. XPS 13 9350/0JXC1H, BIOS 1.2.3 01/08/2016
[Mon Mar 21 21:23:47 2016] Workqueue: events intel_mmio_flip_work_func [i915]
[Mon Mar 21 21:23:47 2016]  0000000000000286 000000005e30db05 ffff880273f0fd20 ffffffff813b54ae
[Mon Mar 21 21:23:47 2016]  ffff880273f0fd68 ffffffffa01f4de8 ffff880273f0fd58 ffffffff810a40f2
[Mon Mar 21 21:23:47 2016]  ffff880066669440 ffff880280d16600 ffff880280d1b000 0000000000000080
[Mon Mar 21 21:23:47 2016] Call Trace:
[Mon Mar 21 21:23:47 2016]  [<ffffffff813b54ae>] dump_stack+0x63/0x85
[Mon Mar 21 21:23:47 2016]  [<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
[Mon Mar 21 21:23:47 2016]  [<ffffffff810a418c>] warn_slowpath_fmt+0x5c/0x80
[Mon Mar 21 21:23:47 2016]  [<ffffffffa018e7d7>] intel_mmio_flip_work_func+0x387/0x3d0 [i915]
[Mon Mar 21 21:23:47 2016]  [<ffffffff810bc596>] process_one_work+0x156/0x430
[Mon Mar 21 21:23:47 2016]  [<ffffffff810bc8be>] worker_thread+0x4e/0x450
[Mon Mar 21 21:23:47 2016]  [<ffffffff8179bd55>] ? __schedule+0x3a5/0xa00
[Mon Mar 21 21:23:47 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
[Mon Mar 21 21:23:47 2016]  [<ffffffff810bc870>] ? process_one_work+0x430/0x430
[Mon Mar 21 21:23:47 2016]  [<ffffffff810c2648>] kthread+0xd8/0xf0
[Mon Mar 21 21:23:47 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
[Mon Mar 21 21:23:47 2016]  [<ffffffff817a088f>] ret_from_fork+0x3f/0x70
[Mon Mar 21 21:23:47 2016]  [<ffffffff810c2570>] ? kthread_worker_fn+0x160/0x160
[Mon Mar 21 21:23:47 2016] ---[ end trace 2e1339ae2448560c ]---
[Mon Mar 21 21:23:47 2016] drm/i915: Resetting chip after gpu hang
[Mon Mar 21 21:23:49 2016] [drm] RC6 on
Comment 31 crow 2016-03-21 20:34:45 UTC
Can someone delete my last comment (and also this), I was going to add attachment and not comment the whole calltrace log.
Comment 32 Sergi Barroso 2016-04-06 18:36:24 UTC
Created attachment 122775 [details]
Kernel log with i915 hang
Comment 33 Sergi Barroso 2016-04-06 18:37:19 UTC
Created attachment 122776 [details]
Error from intel card
Comment 34 Sergi Barroso 2016-04-06 18:43:01 UTC
I have a Dell Latitude e7250 with:

- Intel(R) Core(TM) i7-5600U CPU 
- Intel Corporation Broadwell-U Integrated Graphics (rev 09)

With the same behavior here. I've tried to disable other framebuffer devices such as EFI, but in the meanwhile the only process which worked for me was disabling the iommu in the i915 kernel module, which is not a clean approach.

I've attached my dumps and hope it helps. I'll try Arch Linux anyway to see if has the same problem.

Sergi
Comment 35 Chris Wilson 2016-04-16 05:45:47 UTC
*** Bug 94959 has been marked as a duplicate of this bug. ***
Comment 36 Vladimir Miloserdov 2016-06-02 17:41:47 UTC
It is still unfixed as of now - more than one year after reported. It is crucial - makes laptop unusable for ANY kind of video tasks, as with igfx_off it works really slow. Will such a critical bug finally be assigned to someone? It was reported so many times and affected so many users... Or at least any thoughts, like where to watch this damn DMAR code for this bug?
Comment 37 Harry 2016-08-26 07:56:24 UTC
I have this, too on i5-5287U

the iommu for Linux is disabled by default
For better kvm virtualization, it can be turned on using kernel command line parameter intel_iommu=on

with this kernel parameter present Linux freezes during boot
Comment 38 Bobby Powers 2016-08-26 12:45:42 UTC
This is actually fixed for me in 4.7.  I have a Thinkpad X1 Carbon with an Intel i7-5600U, and haven't seen any issues when enabling the IOMMU over the last month or so (I don't remember if things also worked under 4.6, but I Was running 4.7 RCs for a bit).

Hope that helps.
Comment 39 Julian Cromarty 2016-09-01 08:42:52 UTC
I built and tried 4.7.2 last night and enabling VT-d still causes the freezes to happen so I don't think it is fixed yet. Kernel log messages when the freezes happen contain the following when it was able to recover from the freeze:

[ 1936.694513] [drm] stuck on render ring
[ 1936.694899] [drm] GPU HANG: ecode 8:0:0x85dffffb, in X [3356], reason: Engine(s) hung, action: reset
[ 1936.696494] drm/i915: Resetting chip after gpu hang

And the last one that killed it, requiring the power button to be held to turn the machine off, had a different ecode:

[ 1944.706379] [drm] stuck on render ring
[ 1944.706694] [drm] GPU HANG: ecode 8:0:0xbf9fffff, reason: Engine(s) hung, action: reset
[ 1944.708378] drm/i915: Resetting chip after gpu hang
Comment 40 Michael Groh 2016-10-06 09:04:25 UTC
Created attachment 127048 [details]
drm_error crash dump of brots ThinkPad X250
Comment 41 Michael Groh 2016-10-06 09:06:33 UTC
Hello everyone,

First of all, sorry i forgot to add the comment to the file.

I think am also hitting this bug, i think. The system is a Lenovo ThinkPad X250 - Intel(R) Core(TM) i5-5200U and the HD Graphics 5500.

Since i started using the IOMMU i have been getting hangs and also some X restarts. This is what dmesg says after the fact:

[ 1409.513438] DMAR: DRHD: handling fault status reg 3
[ 1409.513451] DMAR: [DMA Read] Request device [00:02.0] fault addr f5995000 [fault reason 05] PTE Write access is not set
[ 1409.513468] DMAR: DRHD: handling fault status reg 3
[ 1409.513471] DMAR: [DMA Write] Request device [00:02.0] fault addr f5968000 [fault reason 23] Unknown
[ 1418.830396] [drm] GPU HANG: ecode 8:0:0x85dffffb, in kwin_x11 [2191], reason: Hang on render ring, action: reset
[ 1418.830407] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1418.830409] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1418.830410] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1418.830412] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1418.830413] [drm] GPU crash dump saved to /sys/class/drm/card0/error

I will upload said crash dump here. If i can provide you with any more debug-logs or should test something, i would be glad to help.

Thanks,
Michael
Comment 42 yann 2016-10-19 15:42:47 UTC
*** Bug 98309 has been marked as a duplicate of this bug. ***
Comment 43 Chris Wilson 2016-10-19 16:49:20 UTC
(In reply to yann from comment #42)
> *** Bug 98309 has been marked as a duplicate of this bug. ***

I would be careful not to mix gen8/gen9 reports for the moment, not until we have the root cause.
Comment 44 klondike 2016-10-22 13:21:26 UTC
(In reply to Chris Wilson from comment #43)
> (In reply to yann from comment #42)
> > *** Bug 98309 has been marked as a duplicate of this bug. ***
> 
> I would be careful not to mix gen8/gen9 reports for the moment, not until we
> have the root cause.

Well I pointed out early that one of the causes seems to be a conflict between the UEFI framebuffer driver and the intel one, most likely because of some race conditions or both trying to access the same hardware at the same time.

For me disabling the EFI framebuffer solved the issue so far so maybe other reporters may want to test and see if that solves the issue for them too.
Comment 45 Mads 2016-10-23 15:26:25 UTC
(In reply to klondike from comment #44)
> 
> For me disabling the EFI framebuffer solved the issue so far so maybe other
> reporters may want to test and see if that solves the issue for them too.

I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs.
Comment 46 klondike 2016-10-23 15:30:47 UTC
(In reply to Mads from comment #45)
> (In reply to klondike from comment #44)
> > 
> > For me disabling the EFI framebuffer solved the issue so far so maybe other
> > reporters may want to test and see if that solves the issue for them too.
> 
> I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't
> get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs.

Interesting, it seems then that there is more than one different instance of this bug then. Do you have any other FB or driver that interacts with the intel card other than the Intel's modesetting one? The VGA console could be one such driver.
Comment 47 Mads 2016-10-23 15:46:24 UTC
(In reply to Mads from comment #45)
> (In reply to klondike from comment #44)
> > 
> > For me disabling the EFI framebuffer solved the issue so far so maybe other
> > reporters may want to test and see if that solves the issue for them too.
> 
> I've neither had efifb nor fbsimple enabled on my XPS 15 9550, but I didn't
> get rid of this problem until I added "intel_iommu=igfx_off" to my bootargs.

I failed to mention that I didn't come across this bug until trying some drm-intel-nightly based on 4.9.0, but efifb was never involved.

intel_iommu=igfx_off also solved a long standing bug with suspend on my machine, bug 97211
Comment 48 Mads 2016-10-23 16:26:36 UTC
VGA_CONSOLE is built in. But so is also DRM_I915, with firmware, so this appears is dmesg:

...
[    1.051229] [drm] Replacing VGA console driver
...

But, I don't know if what I'm experiencing is relevant here, since I didn't see this bug appear until just recently with drm-intel-nightly (and fixed again by using intel_iommu=igfx_off)
Comment 49 klondike 2016-10-23 16:32:07 UTC
I guess we both are experiencing different bugs with similar symptoms.

In my case at least, the UEFI display driver clashes with the Intel one resulting in the IOMMU violations. This seems to be some kind of firmware bug where the firmware isn't playing along with the MMU settings and Intel's driver.

In yours, the cause may come from somewhere else, hopefully the devs can provide more guidance on what is triggering your case, but if sleep is involved chances are that firmware is somehow part of the issue.
Comment 50 Chris Wilson 2016-11-15 08:05:22 UTC
*** Bug 98728 has been marked as a duplicate of this bug. ***
Comment 51 Chris Wilson 2017-01-07 09:32:01 UTC
*** Bug 99308 has been marked as a duplicate of this bug. ***
Comment 52 Chris Wilson 2017-03-09 13:04:33 UTC
*** Bug 94780 has been marked as a duplicate of this bug. ***
Comment 53 Chris Wilson 2017-03-09 13:04:57 UTC
*** Bug 99964 has been marked as a duplicate of this bug. ***
Comment 54 Ernest Hurtado 2017-04-03 20:53:55 UTC
Created attachment 130657 [details]
/sys/class/drm/card0/error Intel HD Graphics 5500

Hi! I got bite by this bug recently after enabling IOMMU by intel_iommu=on kernel command line. It happen to me once (for now), soon after resume from suspend my GPU hanged.

I saw someone recommended disabling EFI framebuffe - how do you exactly do this? Here excerpt from my dmesg:

dmesg | grep VGA
fb0: EFI VGA frame buffer device
fb: switching to inteldrmfb from EFI VGA
[drm] Replacing VGA console driver


DMESG log during hang:

DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr faf16000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf53000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf4f000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr faf56000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf60000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf5b000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf18000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf1a000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf1c000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 2
DMAR: [DMA Read] Request device [00:02.0] fault addr faf2b000 [fault reason 05] PTE Write access is not set
[drm] GPU HANG: ecode 8:0:0x85dffffb, in Xorg [533], reason: Hang on render ring, action: reset
[drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[drm] GPU crash dump saved to /sys/class/drm/card0/error
drm/i915: Resetting chip after gpu hang
dmar_fault: 235 callbacks suppressed
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff267000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff283000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff251000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff291000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2af000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2bf000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2f0000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2bc000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2dc000 [fault reason 05] PTE Write access is not set
DMAR: DRHD: handling fault status reg 3
DMAR: [DMA Read] Request device [00:02.0] fault addr ff2ed000 [fault reason 05] PTE Write access is not set
drm/i915: Resetting chip after gpu hang
[drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.
Comment 55 Ernest Hurtado 2017-04-11 11:29:15 UTC
Disabling efi framebuffer in kernel didn't help, 'intel_iommu=on,igfx_off' did.
Comment 56 Ricardo 2017-05-09 17:33:10 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included
Comment 57 Chris Wilson 2017-07-06 11:38:22 UTC
*** Bug 101694 has been marked as a duplicate of this bug. ***
Comment 58 Du, Changbin 2017-07-12 11:42:33 UTC
Chris, I did a experiment that make the intel_unmap as a noop. Then the old dma region keeps mapped, and new allocated buf always has new IOVA. Thus I didn't reproduce this issue.

Therefor, I guess BDW GPU may still issue DMA transaction with old unmapped dma region, even that workload has finished. And I always see the hang at PIPE_CTL.

--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -3787,7 +3787,7 @@ static void intel_unmap(struct device *dev, dma_addr_t dev_addr, size_t size)
 	struct intel_iommu *iommu;
 	struct page *freelist;
 
-	if (iommu_no_mapping(dev))
+	//if (iommu_no_mapping(dev))
 		return;
 
 	domain = find_domain(dev);

BTW, sometimes the IOMMU fault addr is 48 bit which more like a Gfx VA, but sometimes it is 33bit or less which more like a IOVA or Pysical Address. Per my understanding IOMMU fault should always be IOVA, so how does a Gfx VA recored, is the dedicated gfx iommu special?

Thanks,
Changbin.
Comment 59 Du, Changbin 2017-07-18 08:43:20 UTC
(In reply to Du, Changbin from comment #58)
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3787,7 +3787,7 @@ static void intel_unmap(struct device *dev, dma_addr_t
> dev_addr, size_t size)
>  	struct intel_iommu *iommu;
>  	struct page *freelist;
>  
> -	if (iommu_no_mapping(dev))
> +	//if (iommu_no_mapping(dev))
>  		return;
>

Our QA helped verify on both Nuc and Server paltofrms, the the result is intersting:
1. for server Iris Pro 6200 (Broadwell GT3e), we still can reporudce it w/ above change.
2. for NUC HD Graphics 5500 (Broadwell GT2), didn't see DMAR error w/ above change.

GT3e has 128MB eDRAM, while GT2 doesn’t. So probably there is problem in gfx cache or memory.
Comment 60 Chris Wilson 2017-07-26 10:50:51 UTC
*** Bug 101785 has been marked as a duplicate of this bug. ***
Comment 61 Elizabeth 2017-07-31 20:01:59 UTC
*** Bug 100203 has been marked as a duplicate of this bug. ***
Comment 62 Elizabeth 2017-07-31 20:04:16 UTC
(In reply to Elizabeth from comment #61)
> *** Bug 100203 has been marked as a duplicate of this bug. ***
From bug 100203:
(In reply to stsp from comment #3)
> There are a few similar reports in this bugzilla,
> but please note that my dmesg ends with Oops. So
> probably it gives more info.
Comment 63 Chris Wilson 2017-08-24 16:24:03 UTC
*** Bug 101236 has been marked as a duplicate of this bug. ***
Comment 64 Chris Wilson 2017-08-24 16:24:58 UTC
*** Bug 101238 has been marked as a duplicate of this bug. ***
Comment 65 Elizabeth 2017-08-31 19:52:13 UTC
*** Bug 100209 has been marked as a duplicate of this bug. ***
Comment 66 Damjan Georgievski 2017-09-22 22:24:29 UTC
Happens to me too when booting 4.13.3 (also .2).

Hardware is
Thinkpad T450s, latest bios updated
DMI: LENOVO 20BWS10N00/20BWS10N00, BIOS JBET65WW (1.29 ) 06/15/2017
i7-5600U (Broadwell)

Booting 4.13.3-1-ARCH hangs when starting Xorg (sddm), the message I caught via ssh in the first second before it was hung completely was:
[  123.917760] DMAR: DRHD: handling fault status reg 2
[  123.917765] DMAR: [DMA Write] Request device [00:02.0] fault addr 108a000 [fault reason 23] Unknown


4.12.13-1-ARCH works fine
4.9.51-1-lts works fine too

Distro is Archlinux.
Booting in CLI works fine too.
Booting with "intel_iommu=on,igfx_off" on the kernel command line also works fine.

$ zgrep IOMMU /proc/config.gz 
CONFIG_GART_IOMMU=y
CONFIG_CALGARY_IOMMU=y
CONFIG_CALGARY_IOMMU_ENABLED_BY_DEFAULT=y
CONFIG_IOMMU_HELPER=y
CONFIG_VFIO_IOMMU_TYPE1=m
# CONFIG_VFIO_NOIOMMU is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
# Generic IOMMU Pagetable Support
CONFIG_IOMMU_IOVA=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_V2=m
CONFIG_INTEL_IOMMU=y
CONFIG_INTEL_IOMMU_SVM=y
CONFIG_INTEL_IOMMU_DEFAULT_ON=y
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set


(ps. by mistake I added this comment to a closed/duplicate bug of this one. sorry about that)
Comment 67 Ernest Hurtado 2017-09-23 12:40:22 UTC
when I set i915.enable_execlists=0 kernel option it doesn't hang afyer resuming from suspend. It spams log with those messages instead:

[drm:i915_gem_idle_work_handler [i915]] *ERROR* Timeout waiting for engines to idle
Comment 68 Ernest Hurtado 2017-09-23 14:25:12 UTC
Patching kernel with workaround posted in https://bugs.freedesktop.org/show_bug.cgi?id=89360#c58 fixes this issue for Intel HD Graphics 5500 (Broadwell GT2) and linux 4.13.3.
Comment 69 François Guerraz 2017-09-29 09:23:52 UTC
This bug has affected many people on arch on a variety of devices when a maintainer switched on CONFIG_INTEL_IOMMU_DEFAULT_ON in kernel 4.13.x
See the bug report: https://bugs.archlinux.org/task/55629
Comment 70 Ernest Hurtado 2017-10-01 09:18:23 UTC
(In reply to Ernest Hurtado from comment #68)
> Patching kernel with workaround posted in
> https://bugs.freedesktop.org/show_bug.cgi?id=89360#c58 fixes this issue for
> Intel HD Graphics 5500 (Broadwell GT2) and linux 4.13.3.

It fixed GPU issue but caused cpu leaks with network stack.
Comment 71 Ansgar Hegerfeld 2017-10-08 12:55:59 UTC
Created attachment 134746 [details]
dmesg caught after GPU hang/reset with 4.13.3
Comment 72 Ansgar Hegerfeld 2017-10-08 12:56:45 UTC
Created attachment 134747 [details]
DRM error dump caught after GPU hang/reset with 4.13.3
Comment 73 Chris Wilson 2018-01-26 12:22:51 UTC
*** Bug 104802 has been marked as a duplicate of this bug. ***
Comment 74 Chris Wilson 2018-02-03 19:10:39 UTC
*** Bug 104929 has been marked as a duplicate of this bug. ***
Comment 75 Jani Saarinen 2018-03-29 07:11:50 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 76 prochazka.nicolas 2018-03-29 07:29:09 UTC
(In reply to Jani Saarinen from comment #75)
> First of all. Sorry about spam.
> This is mass update for our bugs. 
> 
> Sorry if you feel this annoying but with this trying to understand if bug
> still valid or not.
> If bug investigation still in progress, please ignore this and I apologize!
> 
> If you think this is not anymore valid, please comment to the bug that can
> be closed.
> If you haven't tested with our latest pre-upstream tree(drm-tip), can you do
> that also to see if issue is valid there still and if you cannot see issue
> there, please comment to the bug.

not for me
regards
Comment 77 Jani Saarinen 2018-03-29 07:31:16 UTC
Do you mean not seeing issue anymore?
Comment 78 prochazka.nicolas 2018-03-29 08:23:35 UTC
sorry, 
we are using intel_iommu=igfx_off  on all our configuration since this bug.
regards
Comment 79 Chris Wilson 2018-03-30 21:28:06 UTC
*** Bug 105823 has been marked as a duplicate of this bug. ***
Comment 80 Giovanni Grieco 2018-04-08 12:55:25 UTC
Created attachment 138682 [details]
crash dump of Intel Iris Graphics 6100

Description of problem:
GPU HANG when launch systemctl start graphical.target with kernel flag "intel_iommu=on" 

Version-Release number of selected component (if applicable):
xorg-x11-drv-intel-2.99.917-31.20171025.fc27.x86_64
gnome-shell-3.26.2-4.fc27.x86_64
kernel-4.15.14-300.fc27.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Boot with kernel option 'intel_iommu=on'. 
2. GPU will hang at the start of GDM login screen.

Actual results:
dmesg:
[  354.039739] [drm] GPU HANG: ecode 8:-1:0x00000000, reason: Kicking stuck wait on rcs0, action: continue
[  354.039740] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  354.039740] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  354.039741] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  354.039741] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  354.039742] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Expected results:
GPU not hang

Additional info:
Hardware: Apple MacBook Pro Early 2015 13-inch
GPU: Intel Iris Graphics 6100
Distro: Fedora Workstation 27 x86_64
Comment 81 Yves-Alexis 2018-04-09 19:02:45 UTC
(In reply to Jani Saarinen from comment #75)
> If you haven't tested with our latest pre-upstream tree(drm-tip), can you do
> that also to see if issue is valid there still and if you cannot see issue
> there, please comment to the bug.

I haven't tested on drm-tip but the problem still happens on 4.15 with HD 5500
Comment 82 Jani Saarinen 2018-05-04 07:52:42 UTC
Chris, do you see testing drm-tip helps here?
Comment 83 Jani Saarinen 2018-05-17 08:08:54 UTC
Reporter, can you test with latest drm-tip that is now on 4.17.0-rc5
Comment 84 Yves-Alexis 2018-05-17 18:02:11 UTC
(In reply to Jani Saarinen from comment #83)
> Reporter, can you test with latest drm-tip that is now on 4.17.0-rc5

I just tried and after a while (system docked with external screen configured, and some vtswitch because lightdm) I had a freeze. Looking at kern.log I have:

May 17 18:01:22 scapa kernel: [14619.773523] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe C FIFO underrun
May 17 18:01:25 scapa kernel: [14622.309074] [drm:intel_cpu_fifo_underrun_irq_handler] *ERROR* CPU pipe B FIFO underrun
Comment 85 Jani Saarinen 2018-05-17 18:05:33 UTC
Please attach whole log too so send dmesg with drm.debug=0x1e log_buf_len=4M?
Comment 86 Yves-Alexis 2018-05-17 18:09:24 UTC
(In reply to Jani Saarinen from comment #85)
> Please attach whole log too so send dmesg with drm.debug=0x1e log_buf_len=4M?

Unfortunately it's not really reproducible at will, and I assume booting with that options will render the system really slow and barely usable?
Comment 87 stsp2 2018-05-17 22:03:33 UTC
(In reply to Yves-Alexis from comment #86)
> Unfortunately it's not really reproducible at will,

Then I wonder why the bugs like
https://bugs.freedesktop.org/show_bug.cgi?id=100203
https://bugs.freedesktop.org/show_bug.cgi?id=94959
and all the other 100%-reproducible bugs were
marked as a duplicate of this one... :(
Comment 88 Yves-Alexis 2018-05-18 19:45:02 UTC
(In reply to stsp from comment #87)
> (In reply to Yves-Alexis from comment #86)
> > Unfortunately it's not really reproducible at will,
> 
> Then I wonder why the bugs like
> https://bugs.freedesktop.org/show_bug.cgi?id=100203
> https://bugs.freedesktop.org/show_bug.cgi?id=94959
> and all the other 100%-reproducible bugs were
> marked as a duplicate of this one... :(

I can't really comment on the other bug. For me, I can't trigger it right away, but since 3 years now every time I try to remove igfx_off the systems ends up freezing after a while. I try to test latest branches and provide logs, but honestly nothing really changed since 2015.
Comment 89 François Guerraz 2018-05-25 08:22:11 UTC
On latest drm-tip with an i7-6560U the problem only occurs if the GUC is enabled.
Comment 90 Ernest Hurtado 2018-06-26 17:31:43 UTC
After upgrade to 4.18rc I can't reproduce crash anymore.
Comment 91 Francesco Balestrieri 2018-07-02 10:44:30 UTC
Marking resolved based on the last two comments. Before reopening, please make sure you  can reproduce with the latest drm-tip.
Comment 92 James Ausmus 2018-07-18 01:48:32 UTC
Closing, as latest feedback as of two months ago and three weeks ago, is that this is now working.
Comment 93 Yves-Alexis 2018-08-29 17:17:07 UTC
I just had a chance to test 4.18 and I have to say it's *not* fixed. Maybe it's a different bug, but in any case I had a “soft” freeze with following message in dmesg:

Aug 29 19:04:17 scapa kernel: [   26.943249] DMAR: DRHD: handling fault status reg 3
Aug 29 19:04:17 scapa kernel: [   26.943255] DMAR: [DMA Read] Request device [00:02.0] fault addr 4600000 [fault reason 23] Unknown
Aug 29 19:04:17 scapa kernel: [   26.943259] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:17 scapa kernel: [   26.943262] DMAR: [DMA Read] Request device [00:02.0] fault addr 4613000 [fault reason 23] Unknown
Aug 29 19:04:17 scapa kernel: [   26.943264] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:17 scapa kernel: [   26.943267] DMAR: [DMA Read] Request device [00:02.0] fault addr 461b000 [fault reason 23] Unknown
Aug 29 19:04:17 scapa kernel: [   26.943269] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:24 scapa kernel: [   33.831279] [drm] GPU HANG: ecode 8:0:0x85dffffb, in Xorg [1028], reason: hang on rcs0, action: reset
Aug 29 19:04:24 scapa kernel: [   33.831280] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 29 19:04:24 scapa kernel: [   33.831281] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Aug 29 19:04:24 scapa kernel: [   33.831281] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 29 19:04:24 scapa kernel: [   33.831282] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Aug 29 19:04:24 scapa kernel: [   33.831282] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Aug 29 19:04:24 scapa kernel: [   33.831298] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Aug 29 19:04:24 scapa kernel: [   33.838481] dmar_fault: 53 callbacks suppressed
Aug 29 19:04:24 scapa kernel: [   33.838482] DMAR: DRHD: handling fault status reg 3
Aug 29 19:04:24 scapa kernel: [   33.838487] DMAR: [DMA Write] Request device [00:02.0] fault addr 4641000 [fault reason 23] Unknown
Aug 29 19:04:32 scapa kernel: [   41.824158] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Aug 29 19:04:32 scapa kernel: [   41.824723] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:32 scapa kernel: [   41.824729] DMAR: [DMA Write] Request device [00:02.0] fault addr 17f000 [fault reason 23] Unknown
Aug 29 19:04:40 scapa kernel: [   49.813478] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Aug 29 19:04:48 scapa kernel: [   57.804899] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Aug 29 19:04:56 scapa kernel: [   65.799728] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Aug 29 19:04:56 scapa kernel: [   65.882208] wlan0: deauthenticating from 14:0c:76:bf:71:fc by local choice (Reason: 3=DEAUTH_LEAVING)
Aug 29 19:04:56 scapa kernel: [   65.902446] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
Aug 29 19:04:57 scapa kernel: [   66.510770] DMAR: DRHD: handling fault status reg 3
Aug 29 19:04:57 scapa kernel: [   66.510778] DMAR: [DMA Write] Request device [00:02.0] fault addr fffc6000 [fault reason 23] Unknown
Aug 29 19:04:57 scapa kernel: [   66.510781] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:57 scapa kernel: [   66.510784] DMAR: [DMA Write] Request device [00:02.0] fault addr 4d000 [fault reason 23] Unknown
Aug 29 19:04:57 scapa kernel: [   66.510788] DMAR: DRHD: handling fault status reg 2
Aug 29 19:04:57 scapa kernel: [   66.510791] DMAR: [DMA Write] Request device [00:02.0] fault addr 51000 [fault reason 23] Unknown
Aug 29 19:04:57 scapa kernel: [   66.510802] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:02 scapa kernel: [   71.509221] dmar_fault: 6733586 callbacks suppressed
Aug 29 19:05:02 scapa kernel: [   71.509222] DMAR: DRHD: handling fault status reg 3
Aug 29 19:05:02 scapa kernel: [   71.509230] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff4b000 [fault reason 23] Unknown
Aug 29 19:05:02 scapa kernel: [   71.509233] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:02 scapa kernel: [   71.509236] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff53000 [fault reason 23] Unknown
Aug 29 19:05:02 scapa kernel: [   71.509239] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:02 scapa kernel: [   71.509241] DMAR: [DMA Write] Request device [00:02.0] fault addr 32ff57000 [fault reason 23] Unknown
Aug 29 19:05:02 scapa kernel: [   71.509244] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:07 scapa kernel: [   76.511769] dmar_fault: 6751341 callbacks suppressed
Aug 29 19:05:07 scapa kernel: [   76.511770] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:07 scapa kernel: [   76.511775] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6bc000 [fault reason 23] Unknown
Aug 29 19:05:07 scapa kernel: [   76.511778] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:07 scapa kernel: [   76.511781] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6c3000 [fault reason 23] Unknown
Aug 29 19:05:07 scapa kernel: [   76.511784] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:07 scapa kernel: [   76.511787] DMAR: [DMA Write] Request device [00:02.0] fault addr 66e6c8000 [fault reason 23] Unknown
Aug 29 19:05:07 scapa kernel: [   76.511790] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:12 scapa kernel: [   81.514717] dmar_fault: 6802731 callbacks suppressed
Aug 29 19:05:12 scapa kernel: [   81.514718] DMAR: DRHD: handling fault status reg 3
Aug 29 19:05:12 scapa kernel: [   81.514722] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade03000 [fault reason 23] Unknown
Aug 29 19:05:12 scapa kernel: [   81.514725] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:12 scapa kernel: [   81.514728] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade0a000 [fault reason 23] Unknown
Aug 29 19:05:12 scapa kernel: [   81.514731] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:12 scapa kernel: [   81.514733] DMAR: [DMA Write] Request device [00:02.0] fault addr 9ade0e000 [fault reason 23] Unknown
Aug 29 19:05:12 scapa kernel: [   81.514736] DMAR: DRHD: handling fault status reg 2
Aug 29 19:05:12 scapa kernel: [   81.794708] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0
Aug 29 19:05:20 scapa kernel: [   89.793873] i915 0000:00:02.0: Resetting chip for hang on rcs0
Aug 29 19:05:20 scapa kernel: [   89.793938] i915 0000:00:02.0: GPU recovery failed

Unfortunately because of the soft freeze I didn't have a chance to recover /sys/class/drm/card0/error. But the hang happened pretty soon after boot on my broadwell CPU, pretty much as soon as I enabled the external screen when logged on the desktop.
Comment 94 bordjukov 2018-08-29 18:28:25 UTC
Hello,

I can also confirm that I am getting this on 4.18.5. I reproduce it somewhat consistently when I suspend my laptop and then resume it.
Comment 95 Yves-Alexis 2018-08-30 09:55:19 UTC
I managed to reproduce again (same MO), and this time I managed to get /sys/class/drm/card0/error. Before posting it here, can someone confirm it doesn't have any personal information in there (there is some binary in there so I'd prefer some confirmation).
Comment 96 Lakshmi 2018-09-10 09:02:47 UTC
Workaround for gpu hangs that comes with DMAR ERROR is to add intel_iommu=igfx_off kernel option.

IF gpu hangs for some other reason please file a new bug if there isn't a existing bug.

Closing this bug, as the original issue is resolved with a work around fix.
Comment 97 Lakshmi 2018-09-10 09:04:52 UTC
create a new bug for GPU hangs (other than DMAR ERROR case)

When you create, ensure that issue is with latest drmtip. (https://cgit.freedesktop.org/drm-tip)

Attach the full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 98 Yves-Alexis 2018-09-10 09:09:47 UTC
(In reply to Lakshmi from comment #96)
> Workaround for gpu hangs that comes with DMAR ERROR is to add
> intel_iommu=igfx_off kernel option.

Hi Lakshmi,

we know about igfx_off since a long time, but it's not a fix, it's a workaround. We were told the bug was *fixed* in recent kernels (4.18+), but it doesn't seem to be the case, at least on Broadwell.

> Closing this bug, as the original issue is resolved with a work around fix.

I have to admit I'm disappointed by this. Not surprised though, I was kind-of expecting this, it's just took quite a lot of time to finally admit there would be no software fix.

I was told the same for my i7-640LM, is there a chance DMAR will one day work fine with iGPU or should be enforce igfx_off by default in the Linux kernel?
Comment 99 bordjukov 2018-09-10 09:45:16 UTC
(In reply to Lakshmi from comment #96)
> Workaround for gpu hangs that comes with DMAR ERROR is to add
> intel_iommu=igfx_off kernel option.

Hello Lakshmi,

While use of intel_iommu=igfx_off works around the GPU freeze issues, it has major drawbacks. For example it makes Intel VT-d _unusable_ with VirtualBox.

Also, stating that a workaround is a fix for a bug is a contradiction IMO. I still consider that this bug is present and appeal to either keep the issue open until it has been fixed and the fix has been validated, or state outright that there will be no fix and VT-d should be considered buggy on Broadwell.
Comment 100 stsp2 2018-09-10 10:44:02 UTC
(In reply to Lakshmi from comment #96)
> Workaround for gpu hangs that comes with DMAR ERROR is to add
> intel_iommu=igfx_off kernel option.
> 
> IF gpu hangs for some other reason please file a new bug if there isn't a
> existing bug.
> 
> Closing this bug, as the original issue is resolved with a work around fix.

Unbelievable to hear such thing from an Intel employer.
Please refer to the docs:
https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt
```
Graphics Problems?
------------------
If you encounter issues with graphics devices, you can try adding
option intel_iommu=igfx_off to turn off the integrated graphics engine.
If this fixes anything, please ensure you file a bug reporting the problem.
```

"to turn off the integrated graphics engine" is what the doc says.
"please ensure you file a bug" is what the doc says.

This should be re-opened.
Comment 101 Ernest Hurtado 2018-09-11 17:28:57 UTC
Before Linux 4.18 I could reproduce this bug on Broadwell and Skylake hardware. On Linux 4.18 I can't reproduce this on Skylake anymore (after months of testing). I can't test on Broadwell as my machine didn't lived as long as this bug. I don't know which exact commits fixed this issue on Skylake and why it doesn't work on Broadwell as other people reported.

BTW: on Linux 4.19rc1-3 iommu is broken again for graphics but this is unrelated issue which is being worked on.
Comment 102 miguelramos 2018-11-21 17:13:41 UTC
I can confirm that the problem still affects Intel Broadwell 5500 (gen 8) that needs the workaround in kernel 4.19.2 using gentoo-sources and vanilla-sources in Gentoo Linux to start "gdm" without suffering a Intel GPU hang.

dmesg clearly shows that the problem with DMAR hangs the Intel GPU.
Comment 103 miguelramos 2018-11-21 18:14:39 UTC
(In reply to Ernest Hurtado from comment #101)
> Before Linux 4.18 I could reproduce this bug on Broadwell and Skylake
> hardware. On Linux 4.18 I can't reproduce this on Skylake anymore (after
> months of testing). I can't test on Broadwell as my machine didn't lived as
> long as this bug. I don't know which exact commits fixed this issue on
> Skylake and why it doesn't work on Broadwell as other people reported.
> 
> BTW: on Linux 4.19rc1-3 iommu is broken again for graphics but this is
> unrelated issue which is being worked on.

Hi, Ernest.

At least in GentooLinux, I did not have the problem with Broadwell in kernel 4.18 and prior versions since 4.14.x.

The machine is a Dell Inspiron 5540 laptop with Intel integrated graphics and a discrete AMD Topaz GPU. To get both providers working with PRIME, I had to have a conf file in /etc/X11/xorg.d/ declaring DRI "3" for the Intel xf86 video driver.

The hang up problem with the Intel GPU has appeared for me since the 4.19.1 linux kernel.

Now, I can get "gdm" working without any specific xorg driver conf file in /etc/X11/xorg.d/ and the workaround "intel_iommu=igfx_off". I can use both graphic cards with PRIME although the "modesetting" xf86 video driver doesn't yet work and I have to keep the old Intel xf86 video driver.

This behaviour with previous kernels makes me suspect that the iommu breakage in 4.19rc1-3 you mention has something to do with the problem in Broadwell (at least in my laptop). 

Miguel Ángel
Comment 104 Ernest Hurtado 2018-11-22 09:29:08 UTC
(In reply to miguelramos from comment #103)
> This behaviour with previous kernels makes me suspect that the iommu
> breakage in 4.19rc1-3 you mention has something to do with the problem in
> Broadwell (at least in my laptop). 
> 
> Miguel Ángel

In the issue I mentioned the display wasn't working at all at boot (black screen), which is fixed in 4.19 stable release so I think this wasn't related to your problems.
Comment 105 miguelramos 2018-11-23 13:15:12 UTC
Created attachment 142592 [details]
error file after DMAR error hangs GPU trying to start "gdm" with 4.19.3

This is the /sys/class/drm/card0/error file I get after GPU reset fails trying to start gnome-shell gdm
Comment 106 miguelramos 2018-11-23 13:49:09 UTC
Created attachment 142593 [details]
DMAR error in journal trying to start gnome-shell gdm

The attachment here contains the part of the journal showing the DMAR error when the gentoo-linux 4.19.3 boots without the "intel_iommu=igfx_off" option and I try to start gdm.
Comment 107 miguelramos 2018-11-23 16:54:02 UTC
(In reply to Ernest Hurtado from comment #104)
> (In reply to miguelramos from comment #103)
> > This behaviour with previous kernels makes me suspect that the iommu
> > breakage in 4.19rc1-3 you mention has something to do with the problem in
> > Broadwell (at least in my laptop). 
> > 
> > Miguel Ángel
> 
> In the issue I mentioned the display wasn't working at all at boot (black
> screen), which is fixed in 4.19 stable release so I think this wasn't
> related to your problems.

You are right Ernest. 

And because of that, I conjecture that the problem that appears in my case for the first time in kernel 4.19 might be related to the specific changes implemented in the IOMMU support in kernel 4.19.

M. A.
Comment 108 Francesco Balestrieri 2018-12-04 08:09:15 UTC
*** Bug 107921 has been marked as a duplicate of this bug. ***
Comment 109 Francesco Balestrieri 2019-02-06 07:55:05 UTC
miguelramos, do you still the issue with kernel 4.19.5 or later?
Comment 110 Francesco Balestrieri 2019-03-22 09:03:03 UTC
Ping miguelramos?
Comment 111 Lakshmi 2019-03-22 09:09:51 UTC Comment hidden (spam)
Comment 112 Yves-Alexis 2019-03-22 09:48:01 UTC
(In reply to miguelramos from comment #102)
> I can confirm that the problem still affects Intel Broadwell 5500 (gen 8)
> that needs the workaround in kernel 4.19.2 using gentoo-sources and
> vanilla-sources in Gentoo Linux to start "gdm" without suffering a Intel GPU
> hang.
> 
> dmesg clearly shows that the problem with DMAR hangs the Intel GPU.

So I tried removing igfx_off in (Debian) 4.19 kernel (BDW) and it sure still doesn't work. It might be a different bug than before but I still have to use that option.
Comment 113 Francesco Balestrieri 2019-03-22 10:49:41 UTC
Was that with kernel 4.19.5 or later?
Comment 114 Yves-Alexis 2019-03-22 11:20:45 UTC
(In reply to Francesco Balestrieri from comment #113)
> Was that with kernel 4.19.5 or later?

My last test was with 4.19.28
Comment 115 Pacho Ramos 2019-05-03 10:05:05 UTC
(In reply to Francesco Balestrieri from comment #109)
> miguelramos, do you still the issue with kernel 4.19.5 or later?

Yes, exactly the same with kernel 5.0.10

As soon as gdm is started, laptop hangs completely. I still need to pass intel_iommu=igfx_off to workaround it
Comment 116 Pacho Ramos 2019-05-03 10:27:22 UTC
But I don't know how to get the updated logs... because system hungs completely as soon as I try to launch GDM and I hit the bug :/
Comment 117 Toroid 2019-07-06 12:37:31 UTC
Hello, 

i have 

 description: Notebook
    product: LIFEBOOK E8420
    vendor: FUJITSU SIEMENS
    version: E84__
    width: 32 bits
    capabilities: smbios-2.4 dmi-2.4

with 

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
	Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics Controller
	Flags: bus master, fast devsel, latency 0, IRQ 16
	Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
	Memory at d0000000 (64-bit, prefetchable) [size=256M]
	I/O ports at 1800 [size=8]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: i915
	Kernel modules: i915

00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
	Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics Controller
	Flags: bus master, fast devsel, latency 0
	Memory at f2400000 (64-bit, non-prefetchable) [size=1M]
	Capabilities: <access denied>


running 

4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:12 UTC 2019 i686 i686 i686 GNU/Linux

My System turns reproducable into a piece of brick every 2nd boot when switching to inteldrmfb from VESA ?VGA? or so. 
Sometimes the CLI Screen gets a bit cluttered or blacks out or just heats up the room. 

I also noticed: 
KiCAD schematics gets cluttered with mousecross until refresh. 
I can not use a secondary display by hdmi and xrandr. 

I tried to find some loggings but since it happens during bootup i did not find anything to investigate further details. 

If i can support to investigate / fix the annoying shit please let me know what / how to do as i really would like to use my big stationary screen for cad.   

Sincerely Reiner
Comment 118 Lakshmi 2019-07-09 05:34:06 UTC
(In reply to Toroid from comment #117)
> Hello, 
> 
> i have 
> 
>  description: Notebook
>     product: LIFEBOOK E8420
>     vendor: FUJITSU SIEMENS
>     version: E84__
>     width: 32 bits
>     capabilities: smbios-2.4 dmi-2.4
> 
> with 
> 
> 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset
> Integrated Graphics Controller (rev 07) (prog-if 00 [VGA controller])
> 	Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics
> Controller
> 	Flags: bus master, fast devsel, latency 0, IRQ 16
> 	Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
> 	Memory at d0000000 (64-bit, prefetchable) [size=256M]
> 	I/O ports at 1800 [size=8]
> 	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
> 	Capabilities: <access denied>
> 	Kernel driver in use: i915
> 	Kernel modules: i915
> 
> 00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset
> Integrated Graphics Controller (rev 07)
> 	Subsystem: Fujitsu Limited. Mobile 4 Series Chipset Integrated Graphics
> Controller
> 	Flags: bus master, fast devsel, latency 0
> 	Memory at f2400000 (64-bit, non-prefetchable) [size=1M]
> 	Capabilities: <access denied>
> 
> 
> running 
> 
> 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:12 UTC 2019 i686 i686 i686
> GNU/Linux
> 
> My System turns reproducable into a piece of brick every 2nd boot when
> switching to inteldrmfb from VESA ?VGA? or so. 
> Sometimes the CLI Screen gets a bit cluttered or blacks out or just heats up
> the room. 
> 
> I also noticed: 
> KiCAD schematics gets cluttered with mousecross until refresh. 
> I can not use a secondary display by hdmi and xrandr. 
> 
> I tried to find some loggings but since it happens during bootup i did not
> find anything to investigate further details. 
> 
> If i can support to investigate / fix the annoying shit please let me know
> what / how to do as i really would like to use my big stationary screen for
> cad.   
> 
> Sincerely Reiner

Hi, This looks like a different issue than the original bug report. Can you please verify the issue with drmtip (https://cgit.freedesktop.org/drm-tip).
If the problem persists create a new bug and and attach dmesg log from boot with kernel parameter drm.debug=0x1e log_buf_len=4M.
Comment 119 Toroid 2019-07-09 14:48:22 UTC
Thanks but this looks to me like anything but nothing to follow up, check, change or an info at all that might be handy for what ever. I am from networking and project management but not at all a coder except some shell and C, Fortran .... 
Doing some systemconfig is still fine to me, but not reconfig a package and maybe lockup my system which i really need. 


I just try with the kernel dmesg => log to see if i can grab some logging. 
Blacklisting inteldrmfb doesnt help by the way, system still locks up with a black screen.
Comment 120 yunying sun 2019-07-09 14:48:34 UTC
Created attachment 144746 [details]
attachment-15805-0.html

Hi,

I'm out of office during Jul 17 - Jul 19. No email access.

For Wind River issues, please consider talking to Pragyan Pathi. My mobile(+86 13911141692) is reachable if you need immediate response.

Thanks,
Yunying
Comment 121 Martin Peres 2019-11-29 17:12:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/21.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.