The new Coffee Lake board DTCL2KEBQ contains the new i915 PCI ID 8086:3e98, which support was recently added by commit d0e062ebb3a4. The problem is that loading i915 kernel module on this machine breaks the NVMe PCI device out of sudden. The symptom was found at first on SUSE Linux Enterprise 15 update kernel containing the patch above, then confirmed to be present on the latest linux-next. The graphics device itself seems working after loading i915, the frame buffer switches to the native resolution and proceeds the boot. However, it triggers DPC AER messages from the pcieport, and NVMe gets broken. [ 4.881913] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00 source:0x0000 [ 4.881987] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 [ 4.882063] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) [ 4.882136] pcieport 0000:00:1b.0: device [8086:a340] error status/mask=00000001/00002000 [ 4.882196] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) [ 4.921257] systemd: 25 output lines suppressed due to ratelimiting [ 5.162150] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00 source:0x0000 [ 5.162230] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8 [ 5.162289] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID) [ 5.162355] pcieport 0000:00:1b.0: device [8086:a340] error status/mask=00000001/00002000 [ 5.162412] pcieport 0000:00:1b.0: [ 0] Receiver Error (First) .... [ 39.804133] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10 [ 39.928128] nvme 0000:01:00.0: enabling device (0000 -> 0002) [ 39.928732] nvme nvme0: Removing after probe failure status: -19 It turned out that the PCI device entries of other slots are changed just by loading i915 driver. Attached below are the output of lspci -vvv between good-working (no i915) and bad-working cases. For example, you can see the difference in LTR: In the good case we have: Capabilities: [2d0 v1] Latency Tolerance Reporting Max snoop latency: 3145728ns Max no snoop latency: 3145728ns and in the bad case: Capabilities: [2d0 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns The PCI DPC/AER messages could be suppressed by passing pcie_aspm=off boot option (which could be used as a workaround for AMD ThreadRipper), but this didn't "fix" the actual NVMe error itself. Also I tried intel_iommu=igfx_off and pci=nommconf, but in vain. The problem isn't about the boot sequence either. Booting with nomodeset option and let the system start up, then re-load i915 with modeset=1. Then it screws up NVMe as above. So it's actually i915 driver that triggers the breakage.
Created attachment 142188 [details] lspci -vvv output for good-working configuration
Created attachment 142189 [details] lspci -vvv output for broken configuration
Rodrigo, any comments here?
Totally no idea if this is related or not, but the last time I hit a nvme vs. i915 issues was http://mid.mail-archive.com/87shaveb5b.fsf@intel.com
Hi Takashi, Could you please try to disable PC state, DC states, and FBC? intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 Also, could you please provide full dmesg booting with drm.debug=0xe ? Thanks in advance, Rodrigo.
(In reply to Rodrigo Vivi from comment #5) > intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 Didn't help, unfortunately. I attach the dmesg output (but no drm.debug) with these options below. > Also, could you please provide full dmesg booting with drm.debug=0xe ? Attached below, too. This one is without the extra options above.
Created attachment 142258 [details] Kernel booted with intel_idle.max_cstate=1 and other options
Created attachment 142259 [details] Kernel booted with drm.debug=0x0e option
I checked repeatedly and confirmed that pcieport error is reported always right after intel_hdmi_detect. It happens even with nomodest, then reload i915 with modeset=1, so it's not about the boot timing.
So I tried to hack around the function, just like --- a/drivers/gpu/drm/i915/intel_hdmi.c +++ b/drivers/gpu/drm/i915/intel_hdmi.c @@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector, bool force) intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); - if (IS_ICELAKE(dev_priv) && + if (//IS_ICELAKE(dev_priv) && !intel_digital_port_connected(encoder)) goto out; ... and this seems working. No NVMe-related errors are seen after this. I don't mean that this is the right fix, but it indicates that poking the HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently.
The skip of intel_hdmi_detect() works around the problem on SLE15 kernel (that contains lots of i915 backports), too, so this is definitely a key. One thing I noticed is that the machine detects the HDMI output as DP-1. And, the actual DP port doesn't seem working at all. Even BIOS screen doesn't appear from this output. I'm not sure whether it's relevant, but JFYI.
Created attachment 142333 [details] [review] I915-DEBUG printks Hi Takashi, sorry for the delay here. Could you please provide the log for the attached patch? I'd like to understand the flow our code is taking for your particular case. Thanks, Rodrigo
Also, this series just got merged: https://patchwork.freedesktop.org/series/51765/ drm/i915/icl: Fix HDMI on TypeC static ports (rev4) Could you please try to check if that helps somehow? Thanks, Rodrigo.
(In reply to Rodrigo Vivi from comment #12) > Could you please provide the log for the attached patch? > > I'd like to understand the flow our code is taking for your > particular case. This won't give any outputs (actually confirmed); the target system is Coffee Lake, not Ice Lake, thus the patched code path isn't touched at all.
(In reply to Rodrigo Vivi from comment #13) > Also, this series just got merged: > > https://patchwork.freedesktop.org/series/51765/ > > drm/i915/icl: Fix HDMI on TypeC static ports (rev4) > > Could you please try to check if that helps somehow? Can we have a patchset that is cleanly applicable on to of Linux git tree?
BTW, another workaround for this issue is to enforce LSPCON: --- a/drivers/gpu/drm/i915/intel_bios.c +++ b/drivers/gpu/drm/i915/intel_bios.c @@ -2120,6 +2120,7 @@ intel_bios_is_lspcon_present(struct drm_i915_private *dev_priv, if (!HAS_LSPCON(dev_priv)) return false; + return true; for (i = 0; i < dev_priv->vbt.child_dev_num; i++) { child = dev_priv->vbt.child_dev + i; Then the NVMe and AER errors are gone, plus, even the dead DP becomes working!
(In reply to Takashi Iwai from comment #16) > BTW, another workaround for this issue is to enforce LSPCON: ... > Then the NVMe and AER errors are gone, plus, even the dead DP becomes > working! So this looks more like a BIOS bug to me. But it's still helpful if we have some workaround without patching the code in this ugly way...
Ops, sorry about the confusion with the ICL vs CFL... I was looking to other ICL bugs and got confused.... It seems that VBT is the problem here. Could you please attach /sys/kernel/debug/dri/0/i915_vbt here Also a quick check that I'm particular curious: $ sudo ~/igt/build/tools/intel_vbt_decode /sys/kernel/debug/dri/0/i915_vbt | grep -i lspcon
Created attachment 142371 [details] /sys/kernel/debug/dri/0/i915_vbt content
Created attachment 142372 [details] intel_vbt_decode output
% grep -i lspcon vbt-decode Onboard LSPCON: no Onboard LSPCON: no Onboard LSPCON: no Onboard LSPCON: no
(In reply to Takashi Iwai from comment #10) > So I tried to hack around the function, just like > > --- a/drivers/gpu/drm/i915/intel_hdmi.c > +++ b/drivers/gpu/drm/i915/intel_hdmi.c > @@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector, > bool force) > > intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); > > - if (IS_ICELAKE(dev_priv) && > + if (//IS_ICELAKE(dev_priv) && > !intel_digital_port_connected(encoder)) > goto out; > > > ... and this seems working. No NVMe-related errors are seen after this. > > I don't mean that this is the right fix, but it indicates that poking the > HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently. Hmm. I wonder if the gmbus pins are wired up to some other use, but somehow they are still muxed such that gmbus can control them. That definitely sounds like a BIOS bug, or potentially a pinctrl driver bug.
Interesting point. This is a Cannonlake PCH so the table driver follows for the pins is: +----------+-----------+--------------------+ | DDI Type | VBT Value | Bspec Mapped Value | +----------+-----------+--------------------+ | N/A | 0x0 | --- | | DDI-B | 0x1 | 0x1 | | DDI-C | 0x2 | 0x2 | | DDI-D | 0x3 | 0x4 | | DDI-F | 0x4 | 0x3 | +----------+-----------+--------------------+ VBT seems to follow the same numbers there. But maybe someone didn't follow it properly somewhere else. To test this possibility we would need to play with the cnp_ddc_pin_map table or maybe to use the old direct map like skl on map_ddc_pin() ----- But right now what I'm suspecting is that we have LSPCON on this product but VBT is simply lying. Because if we returning true for LSPCON presence besides fixing NVME we also fix port identification and get it working I believe we should find a way to get the right information about this board and from VBT in question. Takashi, I'm assuming this is not and RVP, right? Could you please contact the OEM in question to get more information about the design? and get the contact for us in PVT of who is the Intel FEA involved with this product? Thanks, Rodrigo.
Takashi, any updates here?
The information was already given to Rodrigo via SUSE bugzilla. Rather we've been waiting for the information update from Intel side...
I'm really sorry for the delay here. I've filled internal bug to VBIOS teams. I will keep this and/or the SUSE Bugzilla updated.
(In reply to Rodrigo Vivi from comment #26) > I'm really sorry for the delay here. > > I've filled internal bug to VBIOS teams. I will keep this and/or the > SUSE Bugzilla updated. Rodrigo, any update here?
It is a BIOS bug and customer issue has been closed, so let's reflect it here.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.