Bug 108546 - Loading i915 kernel module breaks NVMe PCI device on the new Coffee Lake box
Summary: Loading i915 kernel module breaks NVMe PCI device on the new Coffee Lake box
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-25 06:44 UTC by Takashi Iwai
Modified: 2019-04-09 15:46 UTC (History)
3 users (show)

See Also:
i915 platform: CFL
i915 features:


Attachments
lspci -vvv output for good-working configuration (24.04 KB, text/plain)
2018-10-25 06:45 UTC, Takashi Iwai
no flags Details
lspci -vvv output for broken configuration (24.02 KB, text/plain)
2018-10-25 06:46 UTC, Takashi Iwai
no flags Details
Kernel booted with intel_idle.max_cstate=1 and other options (58.39 KB, text/plain)
2018-10-29 16:16 UTC, Takashi Iwai
no flags Details
Kernel booted with drm.debug=0x0e option (132.46 KB, text/plain)
2018-10-29 16:17 UTC, Takashi Iwai
no flags Details
I915-DEBUG printks (4.51 KB, patch)
2018-11-02 00:19 UTC, Rodrigo Vivi
no flags Details | Splinter Review
/sys/kernel/debug/dri/0/i915_vbt content (6.00 KB, application/octet-stream)
2018-11-05 14:48 UTC, Takashi Iwai
no flags Details
intel_vbt_decode output (9.81 KB, text/plain)
2018-11-05 14:48 UTC, Takashi Iwai
no flags Details

Description Takashi Iwai 2018-10-25 06:44:04 UTC
The new Coffee Lake board DTCL2KEBQ contains the new i915 PCI ID 8086:3e98, which support was recently added by commit d0e062ebb3a4.
The problem is that loading i915 kernel module on this machine breaks the NVMe PCI device out of sudden.

The symptom was found at first on SUSE Linux Enterprise 15 update kernel containing the patch above, then confirmed to be present on the latest linux-next.

The graphics device itself seems working after loading i915, the frame buffer switches to the native resolution and proceeds the boot.  However, it triggers DPC AER messages from the pcieport, and NVMe gets broken.

[    4.881913] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00 source:0x0000
[    4.881987] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[    4.882063] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
[    4.882136] pcieport 0000:00:1b.0:   device [8086:a340] error status/mask=00000001/00002000
[    4.882196] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
[    4.921257] systemd: 25 output lines suppressed due to ratelimiting
[    5.162150] dpc 0000:00:1b.0:pcie010: DPC containment event, status:0x1f00 source:0x0000
[    5.162230] pcieport 0000:00:1b.0: AER: Corrected error received: id=00d8
[    5.162289] pcieport 0000:00:1b.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00d8(Receiver ID)
[    5.162355] pcieport 0000:00:1b.0:   device [8086:a340] error status/mask=00000001/00002000
[    5.162412] pcieport 0000:00:1b.0:    [ 0] Receiver Error         (First)
....
[   39.804133] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[   39.928128] nvme 0000:01:00.0: enabling device (0000 -> 0002)
[   39.928732] nvme nvme0: Removing after probe failure status: -19

It turned out that the PCI device entries of other slots are changed just by loading i915 driver.  Attached below are the output of lspci -vvv between good-working (no i915) and bad-working cases.  For example, you can see the difference in LTR:

In the good case we have:
	Capabilities: [2d0 v1] Latency Tolerance Reporting
		Max snoop latency: 3145728ns
		Max no snoop latency: 3145728ns

and in the bad case:
	Capabilities: [2d0 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns


The PCI DPC/AER messages could be suppressed by passing pcie_aspm=off boot option (which could be used as a workaround for AMD ThreadRipper), but this didn't "fix" the actual NVMe error itself.

Also I tried intel_iommu=igfx_off and pci=nommconf, but in vain.

The problem isn't about the boot sequence either.  Booting with nomodeset option and let the system start up, then re-load i915 with modeset=1.  Then it screws up NVMe as above.  So it's actually i915 driver that triggers the breakage.
Comment 1 Takashi Iwai 2018-10-25 06:45:35 UTC
Created attachment 142188 [details]
lspci -vvv output for good-working configuration
Comment 2 Takashi Iwai 2018-10-25 06:46:20 UTC
Created attachment 142189 [details]
lspci -vvv output for broken configuration
Comment 3 Lakshmi 2018-10-25 11:22:04 UTC
Rodrigo, any comments here?
Comment 4 Jani Nikula 2018-10-25 15:50:57 UTC
Totally no idea if this is related or not, but the last time I hit a nvme vs. i915 issues was http://mid.mail-archive.com/87shaveb5b.fsf@intel.com
Comment 5 Rodrigo Vivi 2018-10-26 23:35:19 UTC
Hi Takashi,

Could you please try to disable PC state, DC states, and FBC?

intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0

Also, could you please provide full dmesg booting with drm.debug=0xe ?

Thanks in advance,
Rodrigo.
Comment 6 Takashi Iwai 2018-10-29 16:15:51 UTC
(In reply to Rodrigo Vivi from comment #5)
> intel_idle.max_cstate=1 i915.enable_dc=0 i915.enable_fbc=0

Didn't help, unfortunately.  I attach the dmesg output (but no drm.debug) with these options below.

> Also, could you please provide full dmesg booting with drm.debug=0xe ?

Attached below, too.  This one is without the extra options above.
Comment 7 Takashi Iwai 2018-10-29 16:16:38 UTC
Created attachment 142258 [details]
Kernel booted with intel_idle.max_cstate=1 and other options
Comment 8 Takashi Iwai 2018-10-29 16:17:08 UTC
Created attachment 142259 [details]
Kernel booted with drm.debug=0x0e option
Comment 9 Takashi Iwai 2018-10-29 16:27:50 UTC
I checked repeatedly and confirmed that pcieport error is reported always right after intel_hdmi_detect.  It happens even with nomodest, then reload i915 with modeset=1, so it's not about the boot timing.
Comment 10 Takashi Iwai 2018-10-30 10:33:02 UTC
So I tried to hack around the function, just like

--- a/drivers/gpu/drm/i915/intel_hdmi.c
+++ b/drivers/gpu/drm/i915/intel_hdmi.c
@@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector, bool force)
 
 	intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS);
 
-	if (IS_ICELAKE(dev_priv) &&
+	if (//IS_ICELAKE(dev_priv) &&
 	    !intel_digital_port_connected(encoder))
 		goto out;
 

... and this seems working.  No NVMe-related errors are seen after this.

I don't mean that this is the right fix, but it indicates that poking the HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently.
Comment 11 Takashi Iwai 2018-10-30 16:21:11 UTC
The skip of intel_hdmi_detect() works around the problem on SLE15 kernel (that contains lots of i915 backports), too, so this is definitely a key.

One thing I noticed is that the machine detects the HDMI output as DP-1.
And, the actual DP port doesn't seem working at all.  Even BIOS screen doesn't appear from this output.  I'm not sure whether it's relevant, but JFYI.
Comment 12 Rodrigo Vivi 2018-11-02 00:19:34 UTC
Created attachment 142333 [details] [review]
I915-DEBUG printks

Hi Takashi,

sorry for the delay here.

Could you please provide the log for the attached patch?

I'd like to understand the flow our code is taking for your
particular case.

Thanks,
Rodrigo
Comment 13 Rodrigo Vivi 2018-11-02 00:29:46 UTC
Also, this series just got merged:

https://patchwork.freedesktop.org/series/51765/

drm/i915/icl: Fix HDMI on TypeC static ports (rev4)

Could you please try to check if that helps somehow?

Thanks,
Rodrigo.
Comment 14 Takashi Iwai 2018-11-02 13:22:11 UTC
(In reply to Rodrigo Vivi from comment #12)
> Could you please provide the log for the attached patch?
> 
> I'd like to understand the flow our code is taking for your
> particular case.

This won't give any outputs (actually confirmed); the target system is Coffee Lake, not Ice Lake, thus the patched code path isn't touched at all.
Comment 15 Takashi Iwai 2018-11-02 13:52:08 UTC
(In reply to Rodrigo Vivi from comment #13)
> Also, this series just got merged:
> 
> https://patchwork.freedesktop.org/series/51765/
> 
> drm/i915/icl: Fix HDMI on TypeC static ports (rev4)
> 
> Could you please try to check if that helps somehow?

Can we have a patchset that is cleanly applicable on to of Linux git tree?
Comment 16 Takashi Iwai 2018-11-02 13:55:07 UTC
BTW, another workaround for this issue is to enforce LSPCON:

--- a/drivers/gpu/drm/i915/intel_bios.c
+++ b/drivers/gpu/drm/i915/intel_bios.c
@@ -2120,6 +2120,7 @@ intel_bios_is_lspcon_present(struct drm_i915_private *dev_priv,
 	if (!HAS_LSPCON(dev_priv))
 		return false;
 
+	return true;
 	for (i = 0; i < dev_priv->vbt.child_dev_num; i++) {
 		child = dev_priv->vbt.child_dev + i;
 

Then the NVMe and AER errors are gone, plus, even the dead DP becomes working!
Comment 17 Takashi Iwai 2018-11-02 14:49:26 UTC
(In reply to Takashi Iwai from comment #16)
> BTW, another workaround for this issue is to enforce LSPCON:
...
> Then the NVMe and AER errors are gone, plus, even the dead DP becomes
> working!


So this looks more like a BIOS bug to me.
But it's still helpful if we have some workaround without patching the code in this ugly way...
Comment 18 Rodrigo Vivi 2018-11-02 17:40:53 UTC
Ops, sorry about the confusion with the ICL vs CFL... I was looking to other ICL bugs and got confused....

It seems that VBT is the problem here.
Could you please attach /sys/kernel/debug/dri/0/i915_vbt here

Also a quick check that I'm particular curious:

$ sudo ~/igt/build/tools/intel_vbt_decode /sys/kernel/debug/dri/0/i915_vbt | grep -i lspcon
Comment 19 Takashi Iwai 2018-11-05 14:48:23 UTC
Created attachment 142371 [details]
/sys/kernel/debug/dri/0/i915_vbt content
Comment 20 Takashi Iwai 2018-11-05 14:48:51 UTC
Created attachment 142372 [details]
intel_vbt_decode output
Comment 21 Takashi Iwai 2018-11-05 14:49:39 UTC
% grep -i lspcon vbt-decode
		Onboard LSPCON: no
		Onboard LSPCON: no
		Onboard LSPCON: no
		Onboard LSPCON: no
Comment 22 Ville Syrjala 2018-11-05 15:03:28 UTC
(In reply to Takashi Iwai from comment #10)
> So I tried to hack around the function, just like
> 
> --- a/drivers/gpu/drm/i915/intel_hdmi.c
> +++ b/drivers/gpu/drm/i915/intel_hdmi.c
> @@ -1921,7 +1921,7 @@ intel_hdmi_detect(struct drm_connector *connector,
> bool force)
>  
>  	intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS);
>  
> -	if (IS_ICELAKE(dev_priv) &&
> +	if (//IS_ICELAKE(dev_priv) &&
>  	    !intel_digital_port_connected(encoder))
>  		goto out;
>  
> 
> ... and this seems working.  No NVMe-related errors are seen after this.
> 
> I don't mean that this is the right fix, but it indicates that poking the
> HDMI detection for HDMI-3 screws up the NVMe on PCIe, apparently.

Hmm. I wonder if the gmbus pins are wired up to some other use, but somehow they are still muxed such that gmbus can control them. That definitely sounds like a BIOS bug, or potentially a pinctrl driver bug.
Comment 23 Rodrigo Vivi 2018-11-05 20:21:56 UTC
Interesting point.

This is a Cannonlake PCH so the table driver follows for the pins is:

+----------+-----------+--------------------+
| DDI Type | VBT Value | Bspec Mapped Value |
+----------+-----------+--------------------+
| N/A      | 0x0       | ---                |
| DDI-B    | 0x1       | 0x1                |
| DDI-C    | 0x2       | 0x2                |
| DDI-D    | 0x3       | 0x4                |
| DDI-F    | 0x4       | 0x3                |
+----------+-----------+--------------------+

VBT seems to follow the same numbers there. But maybe someone didn't follow
it properly somewhere else.

To test this possibility we would need to play with the 
cnp_ddc_pin_map table or maybe to use the old direct map like
skl on map_ddc_pin()

----- 

But right now what I'm suspecting is that we have LSPCON on this product but
VBT is simply lying.
Because if we returning true for LSPCON presence besides fixing NVME we also fix port identification and get it working I believe we should find a way to get the right information about this board and from VBT in question.

Takashi, I'm assuming this is not and RVP, right? Could you please contact the OEM in question to get more information about the design? and get the contact for us in PVT of who is the Intel FEA involved with this product?

Thanks,
Rodrigo.
Comment 24 Lakshmi 2018-11-27 07:52:33 UTC
Takashi, any updates here?
Comment 25 Takashi Iwai 2018-11-27 12:53:32 UTC
The information was already given to Rodrigo via SUSE bugzilla.

Rather we've been waiting for the information update from Intel side...
Comment 26 Rodrigo Vivi 2018-12-03 23:03:47 UTC
I'm really sorry for the delay here.

I've filled internal bug to VBIOS teams. I will keep this and/or the
SUSE Bugzilla updated.
Comment 27 Lakshmi 2019-02-07 09:10:31 UTC
(In reply to Rodrigo Vivi from comment #26)
> I'm really sorry for the delay here.
> 
> I've filled internal bug to VBIOS teams. I will keep this and/or the
> SUSE Bugzilla updated.

Rodrigo, any update here?
Comment 28 Rodrigo Vivi 2019-04-09 15:46:00 UTC
It is a BIOS bug and customer issue has been closed, so let's reflect it here.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.