Summary: | Failed to initialize GPU and Screen corruption at the top of the screen. | ||
---|---|---|---|
Product: | DRI | Reporter: | Felix von Leitner <felix-freedesktop> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | high | CC: | chris, galkin-vv, intel-gfx-bugs, lakshminarayana.vudum |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | Triaged, ReadyForDev | ||
i915 platform: | GLK | i915 features: | display/Other, GEM/Other |
Attachments: |
Description
Felix von Leitner
2018-09-30 17:37:08 UTC
Reporter, Dmesg log from boot with kernel parameter drm.debug=0xf is needed as said in the log. > [1]Lakshmi changed [2]bug 108103 > ┌────────┬─────────┬──────────┐ > │ What │ Removed │ Added │ > ├────────┼─────────┼──────────┤ > │ Status │ NEW │ NEEDINFO │ > └────────┴─────────┴──────────┘ > [3]Comment # 1 on [4]bug 108103 from [5]Lakshmi > Reporter, Dmesg log from boot with kernel parameter drm.debug=0xf is needed as > said in the log. That WAS my dmesg log from a boot with that kernel parameter. Sorry if that was not obvious. BTW: It was kernel 4.19.0-rc5. Felix Hi, Where in this there is failure to load fw? I only see: [drm:intel_csr_ucode_init] Loading i915/glk_dmc_ver1_04.bin [drm] Finished loading DMC firmware i915/glk_dmc_ver1_04.bin (v1.4) And comparing to our CKL's on CI (https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4905/fi-glk-j4005/boot0.log) <7>[ 5.604667] [drm:intel_csr_ucode_init [i915]] Loading i915/glk_dmc_ver1_04.bin <6>[ 5.613131] [drm] Finished loading DMC firmware i915/glk_dmc_ver1_04.bin (v1.4) Looks quite similar. (In reply to Jani Saarinen from comment #3) > Hi, > Where in this there is failure to load fw? > I only see: > [drm:intel_csr_ucode_init] Loading i915/glk_dmc_ver1_04.bin > [drm] Finished loading DMC firmware i915/glk_dmc_ver1_04.bin (v1.4) > > And comparing to our CKL's on CI I mean GLK's (sorry typo) > (https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4905/fi-glk-j4005/boot0. > log) > > <7>[ 5.604667] [drm:intel_csr_ucode_init [i915]] Loading > i915/glk_dmc_ver1_04.bin > <6>[ 5.613131] [drm] Finished loading DMC firmware > i915/glk_dmc_ver1_04.bin (v1.4) > > Looks quite similar. > That WAS my dmesg log from a boot with that kernel parameter.
> Sorry if that was not obvious.
Felix, this is not full log (from boot). Full logs will give information from 0 seconds. Attach the whole log which contains the timestamps.
> > That WAS my dmesg log from a boot with that kernel parameter.
> > Sorry if that was not obvious.
> Felix, this is not full log (from boot). Full logs will give information from 0
> seconds. Attach the whole log which contains the timestamps.
My kernel is compiled without timestamps.
This was an egrep 'drm|i915' from the dmesg.
Surely you don't want me to send you the USB device detection messages...?!
It turns out that the issues go away if I disable the IOMMU. That is appalling. As I'm sure you realize the IOMMU is a security mechanism to have a fighting chance against malicious firmware in periphery devices, or in this case, against rogue shaders on the GPU. How is it possible that Intel makes the GPU, the IOMMU, the driver for the GPU and the IOMMU, and it still does not work together? Note that I have to disable the IOMMU system wide for the GPU to not "hang". Joonas, any advice here? Hi, What FW fails to load, why this bug exists? Is it FW load issue or GPU hang? Where we can see boot parameters so please do not grep but send all, thanks. The kernel command line was: drm.debug=0xf like the documentation said. And the rootfs UUID. Now there is one more: intel_iommu=off. That made the GPU work. Note that I also have intermittent screen corruption. The upper right part of the screen flashes like someone is writing binary data into the framebuffer. It then disappears again quickly. Maybe this symptom is related? The upper left part is an xterm. The corruption is only in the right part, over the background. I'm using fvwm as window manager, no compositor or anything fancy. I'm guessing my hardware is "too new". Hi, can you send whole dmesg, to see all, like in our CI: <5>[ 0.000000] Linux version 4.19.0-rc7-CI-CI_DRM_4955+ (cidrm@ci-worker1.fi.intel.com) (gcc version 7.3.0 (Ubuntu 7.3.0-27ubuntu1~18.04)) #1 SMP PREEMPT Tue Oct 9 19:43:50 EEST 2018 What kernel you use? Created attachment 141979 [details]
dmesg
This is a new dmesg output.
(In reply to Felix von Leitner from comment #12) > Created attachment 141979 [details] > dmesg > > This is a new dmesg output. Felix, trying to understand the problem here. Do we have two issues atm, 1) GPU hang and 2) Screen corruption including Flickering? From the log attached I don't see GPU hang. Also, when GPU hang occurs we need crash dump file to investigate the issue. Secondly, both the issues are disappeared when IOMMU is disabled? I have two issues, correct. Issue 1 is screen corruption at the very top. Issue 2 is that dmesg said there was a problem and I should open a bug. So I did. Issue 2 goes away when I build a kernel with no IOMMU support or tell the kernel to disable it via command line. Removing the "ReadyForDev" label until this is clearer. I have the similar problem: black screen with j4105 cpu, that reproduces only with iommu enabled My ASRock-J4105-ITX board during boot is identified as DMI: To Be Filled By O.E.M. To Be Filled By O.E.M./J4105-ITX, BIOS P1.40 08/06/2018 Gpu in lspci is identified as 00:02.0 VGA compatible controller: Intel Corporation Device 3185 If I enable VT-d in the EFI Setup and intel_iommu=on parameter, than I have monitor becomes black (though, there is no gpu hang message) during X start (no any screen corruption). During the boot framebuffer console initializes, but contains an error message (on 4.19.5 debian kernel, on previous kernels the message telled something about GPU hang): i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged! There is no error state: # cat /sys/class/drm/card0/error No error state collected Than boot continues using framebuffer console and at the moment of lightdm starting the screen becomes just black (ssh still works fine). Disabling intel_iommu completely solves the problem. Created attachment 142787 [details]
j4105 with 4.19.5 - full dmesg from boot via journalctl
Created attachment 142788 [details]
j4105 with 4.19.5 - latest messages still in dmesg buffer
Created attachment 142789 [details]
j4105 with 4.19.5 lspci -vvv output
With drm.debug=0xf dmesg output was cut, so attaching both it and extended version from journalctl.
Also attaching lspci -vvv outpu
Felix/Vasily How often you see this issue, every time after reboot? Have you tried using Kernel 4.20? Chris, any comments? Created attachment 142841 [details] 142788: j4105 with 4.20 - drm tip full dmesg from boot via journalctl The problem reproduces on every boot. It always hangs while iommu is enabled; Sometimes - with other kernel or after soft reboot - the error messages are a bit different, but I'm not sure - I lost them after reboot. Nothing changed after updating from 4.19 to more recent kernel - ubuntu's build of drm-tip. The exact verison of new kernel is: Linux version 4.20.0-994-lowlatency (kernel@gloin) (gcc version 8.2.0 (Ubuntu 8.2.0-12ubuntu1)) #201812142101 SMP PREEMPT Sat Dec 15 02:06:59 UTC 2018 https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2018-12-15/ These binary packages represent builds of the mainline or stable Linux kernel tree at the commit below: cod/tip/drm-tip/2018-12-15 (2abfab12278273a26679335d0c65980816c42206 I just tried it with Linux 4.20, and I get working X11 even with IOMMU. Great work, thanks! There is still slight screen corruption in the topmost few lines of the screen, but I can live with that. I also tried with vanilla 4.20 build with ubuntu config 'generic' from https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20/ Unfortunately the problem on my j4105 CPU is not fixed: if both VT-d is enabled in bios and intel_iommu=on is passed to kernel - during boot it outputs > i915 0000:00:02.0: Failed to initialize GPU, declaring it wedged! and then hangs during X11 start. The problem even stays the same even I plug an external PCIe gpu and connect monitor to it - while all - uefi, grub, framebuffer, x11 (when i915.modeset=0) is displayed via external gpu - just the "initialization" of i915 driver without any outputs connected leads to this problem (however, since the motherboard has d-sub connector it maybe has some internal hdmi-to-d-sub adapter). With PCIe gpu I also tried with VT-d on, intel_iommu=on and i915.modeset=0 - this works fine (displaying X11 on the PCIe gpu with amdgpu driver) Here is the output of IOMMU groups: > for d in /sys/kernel/iommu_groups/*/devices/*; do n=${d#*/iommu_groups/*}; n=${n%%/*}; printf 'IOMMU Group %s ' "$n"; lspci -nns "${d##*/}"; done; > >IOMMU Group 0 00:00.0 Host bridge [0600]: Intel Corporation Device [8086:31f0] (rev 03) >IOMMU Group 0 00:00.1 Signal processing controller [1180]: Intel Corporation Device [8086:318c] (rev 03) >IOMMU Group 10 00:1f.0 ISA bridge [0601]: Intel Corporation Device [8086:31e8] (rev 03) >IOMMU Group 10 00:1f.1 SMBus [0c05]: Intel Corporation Device [8086:31d4] (rev 03) >IOMMU Group 11 01:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Tobago PRO [Radeon R7 360 / R9 360 OEM] [1002:665f] (rev 81) >IOMMU Group 11 01:00.1 Audio device [0403]: Advanced Micro Devices, Inc. [AMD/ATI] Tobago HDMI Audio [Radeon R7 360 / R9 360 OEM] [1002:aac0] >IOMMU Group 12 03:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 15) >IOMMU Group 13 04:00.0 SATA controller [0106]: ASMedia Technology Inc. ASM1062 Serial ATA Controller [1b21:0612] (rev 02) >IOMMU Group 1 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:3185] (rev 03) >IOMMU Group 2 00:0e.0 Audio device [0403]: Intel Corporation Device [8086:3198] (rev 03) >IOMMU Group 3 00:0f.0 Communication controller [0780]: Intel Corporation Device [8086:319a] (rev 03) >IOMMU Group 4 00:12.0 SATA controller [0106]: Intel Corporation Device [8086:31e3] (rev 03) >IOMMU Group 5 00:13.0 PCI bridge [0604]: Intel Corporation Device [8086:31d8] (rev f3) >IOMMU Group 6 00:13.1 PCI bridge [0604]: Intel Corporation Device [8086:31d9] (rev f3) >IOMMU Group 7 00:13.2 PCI bridge [0604]: Intel Corporation Device [8086:31da] (rev f3) >IOMMU Group 8 00:13.3 PCI bridge [0604]: Intel Corporation Device [8086:31db] (rev f3) >IOMMU Group 9 00:15.0 USB controller [0c03]: Intel Corporation Device [8086:31a8] (rev 03) So the 00:02.0 internal gpu is the sole device in its iommu group. So far we were unable to reproduce the issue with kernel v5.0.0-rc1. Vasily Galkin could you verify whether the above kernel version fixes the issue for you? I've tested on Archlinux with linux-mainline aur kernel (vanilla sources with arch config afaik). The 4.20rc6-1 still had this problem, and updating to 5.1rc1-1 solved it. So confirming that bug is fixed at least in 5.1rc1-1. I didn't tested 5.0 kernels. Now X11 and 3d works fine even with intel_iommu=on in kernel cmdline and in bios. (I checked that iommu groups are actually created). So, marking as fixed. Thanks for verifying! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.