Bug 90037 - [xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5.0 dom0
Summary: [xen iommu] After upgrading to Linux 3.19, desktop no longer works in Xen 4.5...
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
: 91400 (view as bug list)
Depends on:
Blocks: 91400
  Show dependency treegraph
 
Reported: 2015-04-15 09:26 UTC by Ting-Wei Lan
Modified: 2018-12-05 16:02 UTC (History)
2 users (show)

See Also:
i915 platform: ILK
i915 features: GPU hang


Attachments
Screenshot when the system is running in single user mode (2.69 MB, image/jpeg)
2015-04-15 09:29 UTC, Ting-Wei Lan
no flags Details
dmesg (101.58 KB, text/plain)
2015-04-15 09:30 UTC, Ting-Wei Lan
no flags Details
/sys/class/drm/card0/error (1.34 MB, text/plain)
2015-04-15 09:30 UTC, Ting-Wei Lan
no flags Details
dmesg (Xen 4.8.2 + Linux 4.14.4) (237.83 KB, text/plain)
2017-12-11 16:28 UTC, Ting-Wei Lan
no flags Details
/sys/class/drm/card0/error (91.40 KB, text/plain)
2017-12-11 16:29 UTC, Ting-Wei Lan
no flags Details
dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #1 (205.15 KB, text/plain)
2018-09-12 15:28 UTC, Ting-Wei Lan
no flags Details
dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #2 (2.38 MB, text/plain)
2018-09-12 15:46 UTC, Ting-Wei Lan
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ting-Wei Lan 2015-04-15 09:26:04 UTC
When using Linux 3.19 and 4.0 as the dom0 kernel of Xen 4.5.0, characters on the screen become broken after the graphic driver is loaded. Please see the attached screenshot.

After Xorg is started by GDM, it causes more error and my monitor is turned off because of no signal.

[  337.673979] [drm] stuck on render ring
[  337.676815] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[  337.676817] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  337.676818] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  337.676818] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  337.676819] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  337.676820] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  337.680940] drm/i915: Resetting chip after gpu hang
[  343.665948] [drm] stuck on render ring
[  343.669709] [drm] GPU HANG: ecode 5:0:0xfdffffff, in Xorg.bin [2221], reason: Ring hung, action: reset
[  343.670016] [drm:i915_set_reset_status [i915]] *ERROR* gpu hanging too fast, banning!
[  343.673893] drm/i915: Resetting chip after gpu hang
[  345.086609] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020

Please see the attached dmesg and crash dump. This problem causes the desktop unstable and unusable.


Hardware:
Intel Core i5 CPU 650 @ 3.20GHz
Intel Ironlake Desktop

Software:
Bad version:  Xen 4.5.0 and Linux 3.19.2, 3.19.3, 3.19.4, 4.0
Good version: Xen 4.5.0 and Linux 3.18.7
Comment 1 Ting-Wei Lan 2015-04-15 09:29:02 UTC
Created attachment 115079 [details]
Screenshot when the system is running in single user mode
Comment 2 Ting-Wei Lan 2015-04-15 09:30:03 UTC
Created attachment 115080 [details]
dmesg
Comment 3 Ting-Wei Lan 2015-04-15 09:30:30 UTC
Created attachment 115081 [details]
/sys/class/drm/card0/error
Comment 4 Ting-Wei Lan 2015-04-27 04:08:06 UTC
git bisect shows the bad commit is https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=47591df
Comment 5 Jani Nikula 2015-04-27 09:22:39 UTC
(In reply to Ting-Wei Lan from comment #4)
> git bisect shows the bad commit is
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/
> ?id=47591df

commit 47591df505129c9774af6cca2debf283a6e56ed7
Author: Juergen Gross <jgross@suse.com>
Date:   Mon Nov 3 14:02:04 2014 +0100

    xen: Support Xen pv-domains using PAT

Please report this to xen folks. I'll leave this open for tracking purposes for now, although I was tempted to resolve NOTOURBUG.
Comment 6 Ander Conselvan de Oliveira 2015-06-02 07:02:09 UTC
Was this reported to Xen folks? I don't think i915 developers will attempt to fix this, and it has been over a month, so closing as NOTOURBUG.
Comment 7 Ting-Wei Lan 2015-06-11 18:15:58 UTC
It seems this problem is related to Intel VT-d. If I disable VT-d by adding iommu=off to Xen boot options, this error will not happen.
Comment 8 Ting-Wei Lan 2015-06-16 09:46:06 UTC
I think I should reopen this bug because the problem also happens without using Xen.

http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02394.html
http://lists.xenproject.org/archives/html/xen-devel/2015-06/msg02387.html


This problem also happens on Linux >= 3.7 without using Xen when 'intel_iommu=on' is used. It can be worked around by adding 'intel_iommu=igfx_off'. Is it an expected behavior or a bug? Here are some 'dmesg | grep -i iommu' outputs.



Linux 3.6.11 with intel_iommu=on works fine.
[  +0.000000] Intel-IOMMU: enabled
[  +0.005366] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap 
c9008020e30272 ecap 1000
[  +0.005360] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000
[  +0.005359] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap 
c9008020630272 ecap 1000
[  +0.003267] IOMMU 0 0xfed90000: using Register based invalidation
[  +0.006143] IOMMU 2 0xfed93000: using Register based invalidation
[  +0.006141] IOMMU: Setting RMRR:
[  +0.003298] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xd7aec000 - 0xd7afffff]
[  +0.008310] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xd7aec000 - 0xd7afffff]
[  +0.008269] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xe4000 - 0xe7fff]
[  +0.007753] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xe4000 - 0xe7fff]
[  +0.007753] IOMMU: Prepare 0-16MiB unity mapping for LPC
[  +0.005376] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 
0xffffff]


Linux >= 3.7 without any intel_iommu argument works fine.
[  +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap 
c9008020e30272 ecap 1000
[  +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000
[  +0.005384] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap 
c9008020630272 ecap 1000


Linux >= 3.7 with intel_iommu=on causes grahpics problems.
[  +0.000000] Intel-IOMMU: enabled
[  +0.005391] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap 
c9008020e30272 ecap 1000
[  +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000
[  +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap 
c9008020630272 ecap 1000
[  +0.003430] IOMMU: dmar1 using Register based invalidation
[  +0.005553] IOMMU: dmar0 using Register based invalidation
[  +0.005559] IOMMU: dmar2 using Register based invalidation
[  +0.005560] IOMMU: Setting RMRR:
[  +0.003314] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xd7aec000 - 0xd7afffff]
[  +0.008341] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xd7aec000 - 0xd7afffff]
[  +0.008334] IOMMU: Setting identity map for device 0000:00:02.0 
[0xd7c00000 - 0xdfffffff]
[  +0.009797] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xe4000 - 0xe7fff]
[  +0.007795] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xe4000 - 0xe7fff]
[  +0.007798] IOMMU: Prepare 0-16MiB unity mapping for LPC
[  +0.005398] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 - 
0xffffff]


Linux >= 3.7 with intel_iommu=igfx_off works fine.
[  +0.000000] Intel-IOMMU: disable GFX device mapping
[  +0.005388] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap 
c9008020e30272 ecap 1000
[  +0.005385] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000
[  +0.005383] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap 
c9008020630272 ecap 1000


Linux >= 3.7 with both intel_iommu=on and intel_iommu=igfx_off also 
works fine.
[    0.000000] Intel-IOMMU: disable GFX device mapping
[    0.000000] Intel-IOMMU: enabled
[    0.205011] dmar: IOMMU 0: reg_base_addr fed90000 ver 1:0 cap 
c9008020e30272 ecap 1000
[    0.218432] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000
[    0.231848] dmar: IOMMU 2: reg_base_addr fed93000 ver 1:0 cap 
c9008020630272 ecap 1000
[    1.873199] IOMMU: dmar0 using Register based invalidation
[    1.878757] IOMMU: dmar2 using Register based invalidation
[    1.884315] IOMMU: Setting RMRR:
[    1.887631] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xd7aec000 - 0xd7afffff]
[    1.895972] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xd7aec000 - 0xd7afffff]
[    1.904285] IOMMU: Setting identity map for device 0000:00:1a.0 
[0xe4000 - 0xe7fff]
[    1.912079] IOMMU: Setting identity map for device 0000:00:1d.0 
[0xe4000 - 0xe7fff]
[    1.919871] IOMMU: Prepare 0-16MiB unity mapping for LPC
[    1.925268] IOMMU: Setting identity map for device 0000:00:1f.0 [0x0 
- 0xffffff]



It seems the difference between working and broken arguments is 'device 0000:00:02.0', which is the Intel integrated graphics controller.
Comment 9 David Woodhouse 2015-08-18 14:52:13 UTC
It's odd that it was triggered (in the Xen case) by a PAT patch.

What was the actual effect of that patch on the caching mode used by the machine in question?

> [  +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
c0000020230272 ecap 1000

cap & (1<<4) is set, which is the RWBF bit:

    1: Indicates software must explicitly flush
    the write buffers to ensure updates made to
    memory-resident remapping structures are
    visible to hardware.

ecap & (1<<0) is clear, which is the Coherency bit:

    This field indicates if hardware access to the
    root, context, extended-context and
    interrupt-remap tables, and second-level
    paging structures for requests-without-
    PASID, are coherent (snooped) or not.
    • 0:Indicates hardware accesses to
    remapping structures are non-coherent.

So basically this hardware is in a mode where the IOMMU page tables are non-cache coherent. Not only do you have to clflush every cache line in the page tables to main memory when you write it, but you *also* have to jump through hoops to ensure that the writes are pushed through chipset-specific write buffers (see §6.8 of the VT-d specification).

That may help to explain why a seemingly innocent PAT change might have triggered something odd. But it would be good to know precisely what went wrong.

Also, does it help to add 'iommu=pt' to the kernel command line? That would make the IOMMU use a 1:1 mapping of all memory, rather than dynamically setting up mappings.

You say it can be reproduced without Xen, with Linux >= 3.7 — can you show the details of that please? And if it doesn't occur in 3.6, can you also bisect the non-Xen case to find when it started happening, please?

Thanks,
Comment 10 Ting-Wei Lan 2015-08-18 17:13:00 UTC
(In reply to David Woodhouse from comment #9)
> It's odd that it was triggered (in the Xen case) by a PAT patch.
> 
> What was the actual effect of that patch on the caching mode used by the
> machine in question?
> 
> > [  +0.005382] dmar: IOMMU 1: reg_base_addr fed91000 ver 1:0 cap 
> c0000020230272 ecap 1000
> 
> cap & (1<<4) is set, which is the RWBF bit:
> 
>     1: Indicates software must explicitly flush
>     the write buffers to ensure updates made to
>     memory-resident remapping structures are
>     visible to hardware.
> 
> ecap & (1<<0) is clear, which is the Coherency bit:
> 
>     This field indicates if hardware access to the
>     root, context, extended-context and
>     interrupt-remap tables, and second-level
>     paging structures for requests-without-
>     PASID, are coherent (snooped) or not.
>     • 0:Indicates hardware accesses to
>     remapping structures are non-coherent.
> 
> So basically this hardware is in a mode where the IOMMU page tables are
> non-cache coherent. Not only do you have to clflush every cache line in the
> page tables to main memory when you write it, but you *also* have to jump
> through hoops to ensure that the writes are pushed through chipset-specific
> write buffers (see §6.8 of the VT-d specification).
> 
> That may help to explain why a seemingly innocent PAT change might have
> triggered something odd. But it would be good to know precisely what went
> wrong.

Can you tell me how can I test it or provide me a link that describes steps to get needed information? I am not familiar with VT-d spec.

There were discussion on Xen-devel when I tried to make a workaround.
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03642.html
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03723.html

> 
> Also, does it help to add 'iommu=pt' to the kernel command line? That would
> make the IOMMU use a 1:1 mapping of all memory, rather than dynamically
> setting up mappings.

No, screen output is still broken.

> 
> You say it can be reproduced without Xen, with Linux >= 3.7 — can you show
> the details of that please? And if it doesn't occur in 3.6, can you also
> bisect the non-Xen case to find when it started happening, please?

Non-Xen case is already reported here:
https://bugs.freedesktop.org/show_bug.cgi?id=91127

Bisect result:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=edef7e6

Non-Xen case is partially fixed now. Screen output works fine, but the system crashes after using for several hours.

> 
> Thanks,
Comment 11 Ting-Wei Lan 2016-04-28 18:04:53 UTC
*** Bug 91400 has been marked as a duplicate of this bug. ***
Comment 12 Elizabeth 2017-07-31 14:21:27 UTC
Good afternoon,
Sorry for the long delay. Last kernel reported on this case has been 4.0 that is quite old and lots of changes have been made since that, so I'm closing this bug as invalid. If problem persist on newest kernel versions https://www.kernel.org/ please open a new bug with HW and SW information, logs and steps to reproduce. Thank you.
Comment 13 Ting-Wei Lan 2017-09-17 13:25:14 UTC
I can reproduce the problem with the same hardware running Xen 4.8.2 and Linux 4.13.2 unless iommu=no-igfx is passed to Xen hypervisor command line.
Comment 14 Elizabeth 2017-12-08 23:20:39 UTC
Hello again, 
Could you please attach a new dmesg log and error state with newer kernel version with parameters drm.debug=0x1e log_bug_len=2M (or bigger) on grub? 
Thank you.
Comment 15 Elizabeth 2017-12-08 23:22:31 UTC
I'm probably wrong, but this issue may be related to bug 89360.
Comment 16 Ting-Wei Lan 2017-12-11 16:28:43 UTC
Created attachment 136084 [details]
dmesg (Xen 4.8.2 + Linux 4.14.4)

It took me more than 1 hour to get this file ... It crashed too quickly.

Xen dmesg messages were obtained from serial console and 'xl dmesg' command. Linux dmesg messages earlier than timestamp 520.360867 were obtained from 'dmesg' command. All messages after it were obtained from serial console because the system crashed and the ssh connection was broken.

I disabled wayland in /etc/gdm/custom.conf in order to get the result. The system also crashed in wayland mode, but there was no crash dump file or drm message.

Steps of operations:
1. In GRUB menu, remove 'iommu=no-igfx' from Xen command line and add 'drm.debug=0x1e log_buf_len=64M s' to Linux command line.
2. Boot the system and wait 5 minutes to get single user shell.
3. Delete /var/run/nologin.
4. Mount /proc/xen.
5. Start NetworkManager and sshd.
6. Connect to the host from ssh and run 'xl dmesg' and 'dmesg -w' commands.
7. Leave single user shell to continue normal boot.
8. Once the screen output becomes more broken, type 'sudo cat /sys/class/drm/card0/error > gpu_crash_dump; sudo sync' command as soon as possible because the system will stop responding within a few seconds.
9. Reboot the system with Xen console command 'R'.
10. Boot the system normally to download 'gpu_crash_dump' file.
Comment 17 Ting-Wei Lan 2017-12-11 16:29:40 UTC
Created attachment 136085 [details]
/sys/class/drm/card0/error
Comment 18 Elizabeth 2017-12-11 17:52:55 UTC
(In reply to Elizabeth from comment #15)
> I'm probably wrong, but this issue may be related to bug 89360.
Yep, wrong. By previous comments situation seems to be the same pointing to a NOTOURBUG, though there is the VT-d. Let me ping some people to verify.
Comment 19 Jani Saarinen 2018-03-29 07:10:09 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 20 Ting-Wei Lan 2018-04-01 15:11:15 UTC
I just downloaded and tested drm-tip commit c46052cde6a5, and I can still reproduce the problem on this machine.
Comment 21 Jani Saarinen 2018-04-22 15:32:12 UTC
OK, thanks for the feedback. Chris, any help from you on this?
Comment 22 Lakshmi 2018-09-08 22:32:46 UTC
Ting, sorry for the delay.

Do you still have the issue?
If so, try to reproduce the issue using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.

This will speed up the investigation.
Comment 23 Ting-Wei Lan 2018-09-12 15:17:58 UTC
(In reply to Lakshmi from comment #22)
> Ting, sorry for the delay.
> 
> Do you still have the issue?
> If so, try to reproduce the issue using drm-tip
> (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
> log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
> 
> This will speed up the investigation.

Yes, the problem still exists. I could reproduce it with drm-tip commit 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results were similar: all characters are broken and the system was unable to show GDM login screen. The system was accessible from SSH but it couldn't reboot. I ended up pressing 'R' on the Xen hypervisor console to reboot it.
Comment 24 Ting-Wei Lan 2018-09-12 15:28:34 UTC
Created attachment 141535 [details]
dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #1

This is the log from the test of the first time. I am not sure why there is an ext4 error in the log, but the kernel starts printing call traces after showing the error.

I forgot to ask Xen to load Intel CPU microcode update in this test, but I think it should not affect the test result. There is a gap between 11.948721 and 315.808150 in the log because it took 10 minutes to activate LVM.
Comment 25 Ting-Wei Lan 2018-09-12 15:46:13 UTC
Created attachment 141536 [details]
dmesg (Xen 4.10.1 + Linux 4.19.0-rc2+) #2

This is the log from the test of the second time. After the first test, I rebooted the system with 'iommu=no-igfx' set on Xen command line and hoped it could boot normally. However, it stopped and dropped into a shell in initramfs because the fsck on rootfs failed. I manually performed fsck and the system seemed to boot up normally to the desktop. I assumed all filesystem troubles caused by the previous test were now cleaned up, and I rebooted the system to do the second test.

This time I remebered to add 'ucode=-1' to Xen command line to let it load Intel CPU microcode update. The version of the microcode update file is 'revision 0x11, date = 2018-05-08'. The kernel printed a lot of repeated messages in this test and the log quickly grew over 30M. I reset the system from Xen once I saw it printed messages endlessly. Because of the large file size, I only uploaded the first 20000 lines of the log here.
Comment 26 Lakshmi 2018-12-05 11:11:21 UTC
(In reply to Ting-Wei Lan from comment #23)
> (In reply to Lakshmi from comment #22)
> > Ting, sorry for the delay.
> > 
> > Do you still have the issue?
> > If so, try to reproduce the issue using drm-tip
> > (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
> > log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
> > 
> > This will speed up the investigation.
> 
> Yes, the problem still exists. I could reproduce it with drm-tip commit
> 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
> were similar: all characters are broken and the system was unable to show
> GDM login screen. The system was accessible from SSH but it couldn't reboot.
> I ended up pressing 'R' on the Xen hypervisor console to reboot it.

How often you see this issue? Every time you reboot?
Comment 27 Ting-Wei Lan 2018-12-05 16:02:23 UTC
(In reply to Lakshmi from comment #26)
> (In reply to Ting-Wei Lan from comment #23)
> > (In reply to Lakshmi from comment #22)
> > > Ting, sorry for the delay.
> > > 
> > > Do you still have the issue?
> > > If so, try to reproduce the issue using drm-tip
> > > (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e
> > > log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
> > > 
> > > This will speed up the investigation.
> > 
> > Yes, the problem still exists. I could reproduce it with drm-tip commit
> > 6dc8457a2f2093eecb9c6cbb7306fd25bb1664e6. I tested two times and the results
> > were similar: all characters are broken and the system was unable to show
> > GDM login screen. The system was accessible from SSH but it couldn't reboot.
> > I ended up pressing 'R' on the Xen hypervisor console to reboot it.
> 
> How often you see this issue? Every time you reboot?

Yes, it happens on every boot unless I pass iommu=no-igfx to Xen command line.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.