Summary: | Loading i915 kernel module breaks NVMe PCI device on the new Coffee Lake box | ||
---|---|---|---|
Product: | DRI | Reporter: | Takashi Iwai <tiwai> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED NOTOURBUG | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs, james.ausmus, rodrigo.vivi |
Version: | XOrg git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | Triaged | ||
i915 platform: | CFL | i915 features: | |
Attachments: |
Description
Takashi Iwai
2018-10-25 06:44:04 UTC
Created attachment 142188 [details]
lspci -vvv output for good-working configuration
Created attachment 142189 [details]
lspci -vvv output for broken configuration
Rodrigo, any comments here? Totally no idea if this is related or not, but the last time I hit a nvme vs. i915 issues was http://mid.mail-archive.com/87shaveb5b.fsf@intel.com Hi Takashi, Could you please try to disable PC state, DC states, and FBC? intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 Also, could you please provide full dmesg booting with drm.debug=0xe ? Thanks in advance, Rodrigo. (In reply to Rodrigo Vivi from comment #5) > intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0 Didn't help, unfortunately. I attach the dmesg output (but no drm.debug) with these options below. > Also, could you please provide full dmesg booting with drm.debug=0xe ? Attached below, too. This one is without the extra options above. Created attachment 142258 [details]
Kernel booted with intel_idle.max_cstate=1 and other options
Created attachment 142259 [details]
Kernel booted with drm.debug=0x0e option
I checked repeatedly and confirmed that pcieport error is reported always right after intel_hdmi_detect. It happens even with nomodest, then reload i915 with modeset=1, so it's not about the boot timing. So I tried to hack around the function, just like --- a/drivers/gpu/drm/i915/intel_hdmi.c +++ b/drivers/gpu/drm/i915/intel_hdmi.c @@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector, bool force) intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); - if (IS_ICELAKE(dev_priv) && + if (//IS_ICELAKE(dev_priv) && !intel_digital_port_connected(encoder)) goto out; ... and this seems working. No NVMe-related errors are seen after this. I don't mean that this is the right fix, but it indicates that poking the HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently. The skip of intel_hdmi_detect() works around the problem on SLE15 kernel (that contains lots of i915 backports), too, so this is definitely a key. One thing I noticed is that the machine detects the HDMI output as DP-1. And, the actual DP port doesn't seem working at all. Even BIOS screen doesn't appear from this output. I'm not sure whether it's relevant, but JFYI. Created attachment 142333 [details] [review] I915-DEBUG printks Hi Takashi, sorry for the delay here. Could you please provide the log for the attached patch? I'd like to understand the flow our code is taking for your particular case. Thanks, Rodrigo Also, this series just got merged: https://patchwork.freedesktop.org/series/51765/ drm/i915/icl: Fix HDMI on TypeC static ports (rev4) Could you please try to check if that helps somehow? Thanks, Rodrigo. (In reply to Rodrigo Vivi from comment #12) > Could you please provide the log for the attached patch? > > I'd like to understand the flow our code is taking for your > particular case. This won't give any outputs (actually confirmed); the target system is Coffee Lake, not Ice Lake, thus the patched code path isn't touched at all. (In reply to Rodrigo Vivi from comment #13) > Also, this series just got merged: > > https://patchwork.freedesktop.org/series/51765/ > > drm/i915/icl: Fix HDMI on TypeC static ports (rev4) > > Could you please try to check if that helps somehow? Can we have a patchset that is cleanly applicable on to of Linux git tree? BTW, another workaround for this issue is to enforce LSPCON: --- a/drivers/gpu/drm/i915/intel_bios.c +++ b/drivers/gpu/drm/i915/intel_bios.c @@ -2120,6 +2120,7 @@ intel_bios_is_lspcon_present(struct drm_i915_private *dev_priv, if (!HAS_LSPCON(dev_priv)) return false; + return true; for (i = 0; i < dev_priv->vbt.child_dev_num; i++) { child = dev_priv->vbt.child_dev + i; Then the NVMe and AER errors are gone, plus, even the dead DP becomes working! (In reply to Takashi Iwai from comment #16) > BTW, another workaround for this issue is to enforce LSPCON: ... > Then the NVMe and AER errors are gone, plus, even the dead DP becomes > working! So this looks more like a BIOS bug to me. But it's still helpful if we have some workaround without patching the code in this ugly way... Ops, sorry about the confusion with the ICL vs CFL... I was looking to other ICL bugs and got confused.... It seems that VBT is the problem here. Could you please attach /sys/kernel/debug/dri/0/i915_vbt here Also a quick check that I'm particular curious: $ sudo ~/igt/build/tools/intel_vbt_decode /sys/kernel/debug/dri/0/i915_vbt | grep -i lspcon Created attachment 142371 [details]
/sys/kernel/debug/dri/0/i915_vbt content
Created attachment 142372 [details]
intel_vbt_decode output
% grep -i lspcon vbt-decode Onboard LSPCON: no Onboard LSPCON: no Onboard LSPCON: no Onboard LSPCON: no (In reply to Takashi Iwai from comment #10) > So I tried to hack around the function, just like > > --- a/drivers/gpu/drm/i915/intel_hdmi.c > +++ b/drivers/gpu/drm/i915/intel_hdmi.c > @@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector, > bool force) > > intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); > > - if (IS_ICELAKE(dev_priv) && > + if (//IS_ICELAKE(dev_priv) && > !intel_digital_port_connected(encoder)) > goto out; > > > ... and this seems working. No NVMe-related errors are seen after this. > > I don't mean that this is the right fix, but it indicates that poking the > HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently. Hmm. I wonder if the gmbus pins are wired up to some other use, but somehow they are still muxed such that gmbus can control them. That definitely sounds like a BIOS bug, or potentially a pinctrl driver bug. Interesting point. This is a Cannonlake PCH so the table driver follows for the pins is: +----------+-----------+--------------------+ | DDI Type | VBT Value | Bspec Mapped Value | +----------+-----------+--------------------+ | N/A | 0x0 | --- | | DDI-B | 0x1 | 0x1 | | DDI-C | 0x2 | 0x2 | | DDI-D | 0x3 | 0x4 | | DDI-F | 0x4 | 0x3 | +----------+-----------+--------------------+ VBT seems to follow the same numbers there. But maybe someone didn't follow it properly somewhere else. To test this possibility we would need to play with the cnp_ddc_pin_map table or maybe to use the old direct map like skl on map_ddc_pin() ----- But right now what I'm suspecting is that we have LSPCON on this product but VBT is simply lying. Because if we returning true for LSPCON presence besides fixing NVME we also fix port identification and get it working I believe we should find a way to get the right information about this board and from VBT in question. Takashi, I'm assuming this is not and RVP, right? Could you please contact the OEM in question to get more information about the design? and get the contact for us in PVT of who is the Intel FEA involved with this product? Thanks, Rodrigo. Takashi, any updates here? The information was already given to Rodrigo via SUSE bugzilla. Rather we've been waiting for the information update from Intel side... I'm really sorry for the delay here. I've filled internal bug to VBIOS teams. I will keep this and/or the SUSE Bugzilla updated. (In reply to Rodrigo Vivi from comment #26) > I'm really sorry for the delay here. > > I've filled internal bug to VBIOS teams. I will keep this and/or the > SUSE Bugzilla updated. Rodrigo, any update here? It is a BIOS bug and customer issue has been closed, so let's reflect it here. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.