When using Linux 3.19 and 4.0 as the dom0 kernel of Xen 4.5.0, characters on the screen become broken after the graphic driver is loaded. Please see the attached screenshot. After Xorg is started by GDM, it causes more error and my monitor is turned off because of no signal. [ 337.673979] [drm] stuck on render ring [ 337.676815] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset [ 337.676817] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 337.676818] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 337.676818] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 337.676819] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 337.676820] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 337.680940] drm/i915: Resetting chip after gpu hang [ 343.665948] [drm] stuck on render ring [ 343.669709] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset [ 343.670016] [drm:i915_set_reset_status [i915]] *ERROR* gpu hanging too fast, banning! [ 343.673893] drm/i915: Resetting chip after gpu hang [ 345.086609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 Please see the attached dmesg and crash dump. This problem causes the desktop unstable and unusable. Hardware: Intel Core i5 CPU 650 @ 3.20GHz Intel Ironlake Desktop Software: Bad version: Xen 4.5.0 and Linux 3.19.2, 3.19.3, 3.19.4, 4.0 Good version: Xen 4.5.0 and Linux 3.18.7
Created attachment 115079 [details] Screenshot when the system is running in single user mode
Created attachment 115080 [details] dmesg
Created attachment 115081 [details] /sys/class/drm/card0/error
git bisect shows the bad commit is https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=47591df
(In reply to Ting-Wei Lan from comment #4) > git bisect shows the bad commit is > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/ > ?id=47591df commit 47591df505129c9774af6cca2debf283a6e56ed7 Author: Juergen Gross <jgross@suse.com> Date: Mon Nov 3 14:02:04 2014 +0100 xen: Support Xen pv-domains using PAT Please report this to xen folks. I'll leave this open for tracking purposes for now, although I was tempted to resolve NOTOURBUG.
Was this reported to Xen folks? I don't think i915 developers will attempt to fix this, and it has been over a month, so closing as NOTOURBUG.
It seems this problem is related to Intel VT-d. If I disable VT-d by adding iommu=off to Xen boot options, this error will not happen.
I think I should reopen this bug because the problem also happens without using Xen. http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02394.html http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02387.html This problem also happens on Linux >= 3.7 without using Xen when 'intel_iommu=on' is used. It can be worked around by adding 'intel_iommu=igfx_off'. Is it an expected behavior or a bug? Here are some 'dmesg | grep -i iommu' outputs. Linux 3.6.11 with intel_iommu=on works fine. [ +0.000000] Intel-IOMMU: enabled [ +0.005366] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000 [ +0.005360] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 [ +0.005359] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000 [ +0.003267] IOMMU 0 0xfed90000: using Register based invalidation [ +0.006143] IOMMU 2 0xfed93000: using Register based invalidation [ +0.006141] IOMMU: Setting RMRR: [ +0.003298] IOMMU: Setting identity map for device 0000:00:1d.0 [0xd7aec000 - 0xd7afffff] [ +0.008310] IOMMU: Setting identity map for device 0000:00:1a.0 [0xd7aec000 - 0xd7afffff] [ +0.008269] IOMMU: Setting identity map for device 0000:00:1d.0 [0xe4000 - 0xe7fff] [ +0.007753] IOMMU: Setting identity map for device 0000:00:1a.0 [0xe4000 - 0xe7fff] [ +0.007753] IOMMU: Prepare 0-16MiB unity mapping for LPC [ +0.005376] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] Linux >= 3.7 without any intel_iommu argument works fine. [ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000 [ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 [ +0.005384] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000 Linux >= 3.7 with intel_iommu=on causes grahpics problems. [ +0.000000] Intel-IOMMU: enabled [ +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000 [ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 [ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000 [ +0.003430] IOMMU: dmar1 using Register based invalidation [ +0.005553] IOMMU: dmar0 using Register based invalidation [ +0.005559] IOMMU: dmar2 using Register based invalidation [ +0.005560] IOMMU: Setting RMRR: [ +0.003314] IOMMU: Setting identity map for device 0000:00:1a.0 [0xd7aec000 - 0xd7afffff] [ +0.008341] IOMMU: Setting identity map for device 0000:00:1d.0 [0xd7aec000 - 0xd7afffff] [ +0.008334] IOMMU: Setting identity map for device 0000:00:02.0 [0xd7c00000 - 0xdfffffff] [ +0.009797] IOMMU: Setting identity map for device 0000:00:1a.0 [0xe4000 - 0xe7fff] [ +0.007795] IOMMU: Setting identity map for device 0000:00:1d.0 [0xe4000 - 0xe7fff] [ +0.007798] IOMMU: Prepare 0-16MiB unity mapping for LPC [ +0.005398] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] Linux >= 3.7 with intel_iommu=igfx_off works fine. [ +0.000000] Intel-IOMMU: disable GFX device mapping [ +0.005388] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000 [ +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 [ +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000 Linux >= 3.7 with both intel_iommu=on and intel_iommu=igfx_off also works fine. [ 0.000000] Intel-IOMMU: disable GFX device mapping [ 0.000000] Intel-IOMMU: enabled [ 0.205011] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap c9008020e30272 ecap 1000 [ 0.218432] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 [ 0.231848] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap c9008020630272 ecap 1000 [ 1.873199] IOMMU: dmar0 using Register based invalidation [ 1.878757] IOMMU: dmar2 using Register based invalidation [ 1.884315] IOMMU: Setting RMRR: [ 1.887631] IOMMU: Setting identity map for device 0000:00:1a.0 [0xd7aec000 - 0xd7afffff] [ 1.895972] IOMMU: Setting identity map for device 0000:00:1d.0 [0xd7aec000 - 0xd7afffff] [ 1.904285] IOMMU: Setting identity map for device 0000:00:1a.0 [0xe4000 - 0xe7fff] [ 1.912079] IOMMU: Setting identity map for device 0000:00:1d.0 [0xe4000 - 0xe7fff] [ 1.919871] IOMMU: Prepare 0-16MiB unity mapping for LPC [ 1.925268] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 0xffffff] It seems the difference between working and broken arguments is 'device 0000:00:02.0', which is the Intel integrated graphics controller.
It's odd that it was triggered (in the Xen case) by a PAT patch. What was the actual effect of that patch on the caching mode used by the machine in question? > [ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap c0000020230272 ecap 1000 cap & (1<<4) is set, which is the RWBF bit: 1: Indicates software must explicitly flush the write buffers to ensure updates made to memory-resident remapping structures are visible to hardware. ecap & (1<<0) is clear, which is the Coherency bit: This field indicates if hardware access to the root, context, extended-context and interrupt-remap tables, and second-level paging structures for requests-without- PASID, are coherent (snooped) or not. • 0:Indicates hardware accesses to remapping structures are non-coherent. So basically this hardware is in a mode where the IOMMU page tables are non-cache coherent. Not only do you have to clflush every cache line in the page tables to main memory when you write it, but you *also* have to jump through hoops to ensure that the writes are pushed through chipset-specific write buffers (see §6.8 of the VT-d specification). That may help to explain why a seemingly innocent PAT change might have triggered something odd. But it would be good to know precisely what went wrong. Also, does it help to add 'iommu=pt' to the kernel command line? That would make the IOMMU use a 1:1 mapping of all memory, rather than dynamically setting up mappings. You say it can be reproduced without Xen, with Linux >= 3.7 — can you show the details of that please? And if it doesn't occur in 3.6, can you also bisect the non-Xen case to find when it started happening, please? Thanks,
(In reply to David Woodhouse from comment #9) > It's odd that it was triggered (in the Xen case) by a PAT patch. > > What was the actual effect of that patch on the caching mode used by the > machine in question? > > > [ +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap > c0000020230272 ecap 1000 > > cap & (1<<4) is set, which is the RWBF bit: > > 1: Indicates software must explicitly flush > the write buffers to ensure updates made to > memory-resident remapping structures are > visible to hardware. > > ecap & (1<<0) is clear, which is the Coherency bit: > > This field indicates if hardware access to the > root, context, extended-context and > interrupt-remap tables, and second-level > paging structures for requests-without- > PASID, are coherent (snooped) or not. > • 0:Indicates hardware accesses to > remapping structures are non-coherent. > > So basically this hardware is in a mode where the IOMMU page tables are > non-cache coherent. Not only do you have to clflush every cache line in the > page tables to main memory when you write it, but you *also* have to jump > through hoops to ensure that the writes are pushed through chipset-specific > write buffers (see §6.8 of the VT-d specification). > > That may help to explain why a seemingly innocent PAT change might have > triggered something odd. But it would be good to know precisely what went > wrong. Can you tell me how can I test it or provide me a link that describes steps to get needed information? I am not familiar with VT-d spec. There were discussion on Xen-devel when I tried to make a workaround. http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03642.html http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03723.html > > Also, does it help to add 'iommu=pt' to the kernel command line? That would > make the IOMMU use a 1:1 mapping of all memory, rather than dynamically > setting up mappings. No, screen output is still broken. > > You say it can be reproduced without Xen, with Linux >= 3.7 — can you show > the details of that please? And if it doesn't occur in 3.6, can you also > bisect the non-Xen case to find when it started happening, please? Non-Xen case is already reported here: https://bugs.freedesktop.org/show_bug.cgi?id=91127 Bisect result: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=edef7e6 Non-Xen case is partially fixed now. Screen output works fine, but the system crashes after using for several hours. > > Thanks,
*** Bug 91400 has been marked as a duplicate of this bug. ***
Good afternoon, Sorry for the long delay. Last kernel reported on this case has been 4.0 that is quite old and lots of changes have been made since that, so I'm closing this bug as invalid. If problem persist on newest kernel versions https://www.kernel.org/ please open a new bug with HW and SW information, logs and steps to reproduce. Thank you.
I can reproduce the problem with the same hardware running Xen 4.8.2 and Linux 4.13.2 unless iommu=no-igfx is passed to Xen hypervisor command line.
Hello again, Could you please attach a new dmesg log and error state with newer kernel version with parameters drm.debug=0x1e log_bug_len=2M (or bigger) on grub? Thank you.
I'm probably wrong, but this issue may be related to bug 89360.
Created attachment 136084 [details] dmesg (Xen 4.8.2 + Linux 4.14.4) It took me more than 1 hour to get this file ... It crashed too quickly. Xen dmesg messages were obtained from serial console and 'xl dmesg' command. Linux dmesg messages earlier than timestamp 520.360867 were obtained from 'dmesg' command. All messages after it were obtained from serial console because the system crashed and the ssh connection was broken. I disabled wayland in /etc/gdm/custom.conf in order to get the result. The system also crashed in wayland mode, but there was no crash dump file or drm message. Steps of operations: 1. In GRUB menu, remove 'iommu=no-igfx' from Xen command line and add 'drm.debug=0x1e log_buf_len=64M s' to Linux command line. 2. Boot the system and wait 5 minutes to get single user shell. 3. Delete /var/run/nologin. 4. Mount /proc/xen. 5. Start NetworkManager and sshd. 6. Connect to the host from ssh and run 'xl dmesg' and 'dmesg -w' commands. 7. Leave single user shell to continue normal boot. 8. Once the screen output becomes more broken, type 'sudo cat /sys/class/drm/card0/error > gpu_crash_dump; sudo sync' command as soon as possible because the system will stop responding within a few seconds. 9. Reboot the system with Xen console command 'R'. 10. Boot the system normally to download 'gpu_crash_dump' file.
Created attachment 136085 [details] /sys/class/drm/card0/error
(In reply to Elizabeth from comment #15) > I'm probably wrong, but this issue may be related to bug 89360. Yep, wrong. By previous comments situation seems to be the same pointing to a NOTOURBUG, though there is the VT-d. Let me ping some people to verify.
First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
I just downloaded and tested drm-tip commit c46052cde6a5, and I can still reproduce the problem on this machine.
OK, thanks for the feedback. Chris, any help from you on this?
Ting, sorry for the delay. Do you still have the issue? If so, try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot. This will speed up the investigation.
(In reply to Lakshmi from comment #22) > Ting, sorry for the delay. > > Do you still have the issue? > If so, try to reproduce the issue using drm-tip > (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e > log_buf_len=4M, and if the problem persists attach the full dmesg from boot. > > This will speed up the investigation. Yes, the problem still exists. I could reproduce it with drm-tip commit 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results were similar: all characters are broken and the system was unable to show GDM login screen. The system was accessible from SSH but it couldn't reboot. I ended up pressing 'R' on the Xen hypervisor console to reboot it.
Created attachment 141535 [details] dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #1 This is the log from the test of the first time. I am not sure why there is an ext4 error in the log, but the kernel starts printing call traces after showing the error. I forgot to ask Xen to load Intel CPU microcode update in this test, but I think it should not affect the test result. There is a gap between 11.948721 and 315.808150 in the log because it took 10 minutes to activate LVM.
Created attachment 141536 [details] dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #2 This is the log from the test of the second time. After the first test, I rebooted the system with 'iommu=no-igfx' set on Xen command line and hoped it could boot normally. However, it stopped and dropped into a shell in initramfs because the fsck on rootfs failed. I manually performed fsck and the system seemed to boot up normally to the desktop. I assumed all filesystem troubles caused by the previous test were now cleaned up, and I rebooted the system to do the second test. This time I remebered to add 'ucode=-1' to Xen command line to let it load Intel CPU microcode update. The version of the microcode update file is 'revision 0x11, date = 2018-05-08'. The kernel printed a lot of repeated messages in this test and the log quickly grew over 30M. I reset the system from Xen once I saw it printed messages endlessly. Because of the large file size, I only uploaded the first 20000 lines of the log here.
(In reply to Ting-Wei Lan from comment #23) > (In reply to Lakshmi from comment #22) > > Ting, sorry for the delay. > > > > Do you still have the issue? > > If so, try to reproduce the issue using drm-tip > > (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e > > log_buf_len=4M, and if the problem persists attach the full dmesg from boot. > > > > This will speed up the investigation. > > Yes, the problem still exists. I could reproduce it with drm-tip commit > 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results > were similar: all characters are broken and the system was unable to show > GDM login screen. The system was accessible from SSH but it couldn't reboot. > I ended up pressing 'R' on the Xen hypervisor console to reboot it. How often you see this issue? Every time you reboot?
(In reply to Lakshmi from comment #26) > (In reply to Ting-Wei Lan from comment #23) > > (In reply to Lakshmi from comment #22) > > > Ting, sorry for the delay. > > > > > > Do you still have the issue? > > > If so, try to reproduce the issue using drm-tip > > > (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e > > > log_buf_len=4M, and if the problem persists attach the full dmesg from boot. > > > > > > This will speed up the investigation. > > > > Yes, the problem still exists. I could reproduce it with drm-tip commit > > 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results > > were similar: all characters are broken and the system was unable to show > > GDM login screen. The system was accessible from SSH but it couldn't reboot. > > I ended up pressing 'R' on the Xen hypervisor console to reboot it. > > How often you see this issue? Every time you reboot? Yes, it happens on every boot unless I pass iommu=no-igfx to Xen command line.
The problem still exist in 4.19.10. I also have an Ironlake iGPU: $ grep 'model name' /proc/cpuinfo | head -n1 model name : Intel(R) Core(TM) i5 CPU 660 @ 3.33GHz
@Chris, any other suggestion to this issue apart from using it by intel_iommu=off?
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/22.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.