Bug 93956 - Display Port BYT-M [N2807] - Data link training fails sporadically
Summary: Display Port BYT-M [N2807] - Data link training fails sporadically
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-02-01 14:55 UTC by Zoran Stojsavljevic
Modified: 2017-02-21 09:29 UTC (History)
5 users (show)

See Also:
i915 platform: BYT
i915 features: display/DP


Attachments
Patch to fix i2c for passive adapters (4.25 KB, patch)
2016-02-02 15:36 UTC, Jim Bride
no flags Details | Splinter Review

Description Zoran Stojsavljevic 2016-02-01 14:55:17 UTC
HW: ATOM BYT-M (N2807)
SW: Vanilla kernels 4.4.0_rc3+

ORIGINAL PROBLEM STATEMENT by Werner (Zeh):

On our current design, where SKU N2807 is used, we can observe an error case where the display port sporadically fails.

We have a DP->LVDS converter chip (PTN3460IBS) soldered on the board and therefore have a display port device hard connected to SoC.

In the error case, data link training fails on display port. We did some measurements where one can see that there is absolutely no traffic on display port lane if the error happens. However, AUX-port seems to work fine as there is traffic on it when the error happens.

We can even see the error in the Linux Kernel log, which provides the following messages:
[drm] Initialized drm 1.1.0 20060810
[drm] Memory usable by graphics device = 2048M
[drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[drm] Driver supports precise vblank timestamp query.
[drm] GMBUS [i915 gmbus dpc] timed out, falling back to bit banging on pin 4
fbcon: inteldrmfb (fb0) is primary device
tsc: Refined TSC clocksource calibration: 1583.333 MHz
[drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off
Switched to clocksource tsc
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_start_link_train] *ERROR* too many full retries, give up
[drm:intel_dp_complete_link_train] *ERROR* failed to train DP, aborting
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1 at drivers/gpu/drm/i915/intel_display.c:1539 vlv_wait_port_ready+0xf3/0x110()
timed out waiting for port C ready: 0xf000a0ff
Modules linked in:
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.14 #1-V%#.%#.#%.#%
00000000 00000000 9a437824 814ad4cc 9a437864 9a437854 8103db6e 81647f58
9a437880 00000001 816467ec 00000603 812e3373 812e3373 99c08000 000000f0
ffff9208 9a43786c 8103dbc3 00000009 9a437864 81647f58 9a437880 9a437898
Call Trace:
[<814ad4cc>] dump_stack+0x48/0x69
[<8103db6e>] warn_slowpath_common+0x7e/0xa0
[<812e3373>] ? vlv_wait_port_ready+0xf3/0x110
[<812e3373>] ? vlv_wait_port_ready+0xf3/0x110
[<8103dbc3>] warn_slowpath_fmt+0x33/0x40
[<812e3373>] vlv_wait_port_ready+0xf3/0x110
[<812fdb92>] vlv_pre_enable_dp+0xd2/0x130
[<812e7c62>] valleyview_crtc_enable+0x102/0x3b0
[<812fa21f>] ? intel_dp_mode_set+0x2f/0x300
[<812ea226>] __intel_set_mode+0x6f6/0x940
[<812ecfe3>] intel_set_mode+0x23/0x40
[<812ed809>] intel_crtc_set_config+0x719/0x8f0
[<812a31eb>] drm_mode_set_config_internal+0x4b/0xc0
[<81295825>] drm_fb_helper_set_par+0x185/0x200
[<81227ef2>] fbcon_init+0x502/0x550
[<812729ce>] visual_init+0x9e/0x100
[<81274cd6>] do_bind_con_driver+0x106/0x2f0
[<811623fc>] ? sysfs_create_file_ns+0x2c/0x30
[<812753cd>] do_take_over_console+0xfd/0x190
[<812256bf>] do_fbcon_takeover+0x5f/0xc0
[<8122aa3f>] fbcon_event_notify+0x6ef/0x7f0
[<8105e031>] notifier_call_chain+0x41/0x60
[<8105e36b>] __blocking_notifier_call_chain+0x3b/0x60
[<8105e3af>] blocking_notifier_call_chain+0x1f/0x30
[<8121cd66>] fb_notifier_call_chain+0x16/0x20
[<8121e76f>] register_framebuffer+0x1af/0x2b0
[<81295434>] drm_fb_helper_initial_config+0x2d4/0x470
[<8131d6f4>] ? gen6_write32+0x64/0x120
[<81104b88>] ? kmem_cache_alloc_trace+0x128/0x130
[<81293b99>] ? drm_fb_helper_init+0xf9/0x160
[<8132412e>] intel_fbdev_initial_config+0x1e/0x20
[<812bb62c>] i915_driver_load+0xc9c/0xcd0
[<8140c4b0>] ? hiddev_disconnect+0x90/0x90
[<8129dfda>] drm_dev_register+0x6a/0x140
[<812a0171>] drm_get_pci_dev+0xc1/0x1e0
[<811657a5>] ? kernfs_create_link+0x55/0x90
[<812b7ec5>] i915_pci_probe+0x35/0x60
[<812109df>] pci_device_probe+0x5f/0xb0
[<81162955>] ? sysfs_create_link+0x25/0x40
[<81342ec3>] really_probe+0x53/0x1f0
[<81210712>] ? pci_match_device+0xb2/0xc0
[<81343127>] __driver_attach+0x77/0x80
[<813430b0>] ? __device_attach+0x50/0x50
[<81341687>] bus_for_each_dev+0x47/0x80
[<81342abe>] driver_attach+0x1e/0x20
[<813430b0>] ? __device_attach+0x50/0x50
[<813427af>] bus_add_driver+0x13f/0x1f0
[<813434f9>] driver_register+0x59/0xe0
[<81210822>] __pci_register_driver+0x32/0x40
[<812a0392>] drm_pci_init+0x102/0x110
[<817542b2>] ? ttm_init+0x64/0x64
[<81754314>] i915_init+0x62/0x64
[<81000472>] do_one_initcall+0xd2/0x120
[<8115a7bb>] ? __proc_create+0x9b/0xd0
[<810583e8>] ? parameq+0x18/0x70
[<817274a1>] ? do_early_param+0x78/0x78
[<81727400>] ? loglevel+0x2/0x2b
[<8105861f>] ? parse_args+0x1df/0x330
[<81078a7f>] ? __wake_up+0x3f/0x50
[<81727b04>] kernel_init_freeable+0xe8/0x18f
[<817274a1>] ? do_early_param+0x78/0x78
[<814a68b0>] kernel_init+0x10/0xe0
[<814b44b7>] ret_from_kernel_thread+0x1b/0x28
[<814a68a0>] ? rest_init+0x80/0x80
---[ end trace c03f51a8b9c35138 ]---
fbcon_init: disable boot-logo (boot-logo bigger than screen).
Console: switching to colour frame buffer device 240x67
i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
i915 0000:00:02.0: registered panic notifier [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0

If this error happens, a reset of the SoC only resolves the issue (the rest of the board and especially the DP->LVDS converter stays unchanged).

We have tried to disable the display port in Linux driver as soon as the error happens with no result.
_______

PROBLEM DESCRIPTION by Zoran (Stojsavljevic):

Hello Werner,

Upon reading your latest findings, it is obvious that there are FW and SW bugs in different scenarios preventing N2807 GFX port C to operate correctly.

As my best understanding is, these are three use cases you did identify with the N2807 (BYT-M), using actual sku and INTEL Bayley Bay CRb Fab.3 (Here I repeat, adding course of action I'll take to the escalation path within INTEL for FSP, and Open Source).
[1] UEFI BIOS used (along with integrated GOP driver) - problem NOT reproducible, reference design to be achieved by other two cases;
[2] FSP + Coreboot + vBIOS used + SeaBIOS + Linux kernel - problem is visible - will create FSP HSD against PED FSP team;
[3] FSP + Coreboot (vBIOS NOT used), ONLY Linux kernel driver is used to init GFX - problem is visible - will create Bugzilla system entry for GFX OTC INTEL team.

Since you have reproduced this problem on the Bayley Bay Fab. 3 CRB, I am engaging INTEL OTC team in further resolution of this problem.

Thank you,
Zoran
Comment 1 Jani Nikula 2016-02-01 15:57:07 UTC
(In reply to Zoran Stojsavljevic from comment #0)
> SW: Vanilla kernels 4.4.0_rc3+

...

> CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.14 #1-V%#.%#.#%.#%

Please add drm.debug=14 module parameter and attach dmesg from boot to the problem, running v4.4 or later.
Comment 2 Jim Bride 2016-02-02 15:36:45 UTC
Created attachment 121460 [details] [review]
Patch to fix i2c for passive adapters

Please try back porting the attached patch to the customer's kernel and see if it helps.  This patch was the result of the debugging we did last summer with dongle issues, which ended up being a general i2c problem rather than something specific to the i915.  I know the adaptor is on-board, but this fix could still apply.  I don't remember for sure if this one landed in time for Linux 4.1 or if it was 4.2.
Comment 3 Werner Zeh 2016-02-03 06:41:25 UTC
Hi.

I have checked our kernel, which is version 4.2.0.16.
The mentioned patch is already in that kernel which shows the display port link training issue.
To me, it seems like the PHY of display port links is not running properly. This is indicated by DPLLA_CTRL-Reg, where the lower byte holds the phy status for both ports. As it seems not to happen with GOP+UEFI, maybe some registers are initialized in a different way using the kernel driver.

And I just want to highlight that this error happens only on few boards, not on every. And these boards are affected shows a temperature dependency: the error is more likely to happen when SoC is cold.
Comment 4 Zoran Stojsavljevic 2016-02-03 07:08:06 UTC
Hello Jim,

I would write this code differently:

482         struct drm_i915_private *dev_priv = bus->dev_priv;
483 --      int i = 0, inc, try = 0;
483 ++      int i = 0, inc = 1, try = 0;
484         int ret = 0;
485 
486         intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS);
487         mutex_lock(&dev_priv->gmbus_mutex);
488 
489         if (bus->force_bit) {
490                 ret = i2c_bit_algo.master_xfer(adapter, msgs, num);
491                 goto out;
492         }
493 
494 retry:
495         I915_WRITE(GMBUS0, bus->reg0);
496 
497         for (; i < num; i += inc) {
498 --              inc = 1;
499                 if (gmbus_is_index_read(msgs, i, num)) {
500                         ret = gmbus_xfer_index_read(dev_priv, &msgs[i]);
501                         inc = 2; /* an index read is two msgs */
502                 } else if (msgs[i].flags & I2C_M_RD) {
503                         ret = gmbus_xfer_read(dev_priv, &msgs[i], 0);
504                 } else {
505                         ret = gmbus_xfer_write(dev_priv, &msgs[i]);
506                 }
507 
508                 if (ret == -ETIMEDOUT)
509                         goto timeout;
510                 if (ret == -ENXIO)
511                         goto clear_err;
512 
513                 ret = gmbus_wait_hw_status(dev_priv, GMBUS_HW_WAIT_PHASE,
514                                            GMBUS_HW_WAIT_EN);
515                 if (ret == -ENXIO)
516                         goto clear_err;
517                 if (ret)
518                         goto timeout;
519         }

Since you can be potentially stuck in:
497         for (; i < num; i += inc) {

forever.

Thank you,
Zoran
Comment 5 Zoran Stojsavljevic 2016-02-03 07:27:04 UTC
(In reply to Zoran Stojsavljevic from comment #4)
> Hello Jim,
> 
> I would write this code differently:
> 
> 482         struct drm_i915_private *dev_priv = bus->dev_priv;
> 483 --      int i = 0, inc, try = 0;
> 483 ++      int i = 0, inc = 1, try = 0;
> 484         int ret = 0;
> 485 
> 486         intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS);
> 487         mutex_lock(&dev_priv->gmbus_mutex);
> 488 
> 489         if (bus->force_bit) {
> 490                 ret = i2c_bit_algo.master_xfer(adapter, msgs, num);
> 491                 goto out;
> 492         }
> 493 
> 494 retry:
> 495         I915_WRITE(GMBUS0, bus->reg0);
> 496 
> 497         for (; i < num; i += inc) {
> 498 --              inc = 1;
> 499                 if (gmbus_is_index_read(msgs, i, num)) {
> 500                         ret = gmbus_xfer_index_read(dev_priv, &msgs[i]);
> 501                         inc = 2; /* an index read is two msgs */
> 502                 } else if (msgs[i].flags & I2C_M_RD) {
> 503                         ret = gmbus_xfer_read(dev_priv, &msgs[i], 0);
> 504                 } else {
> 505                         ret = gmbus_xfer_write(dev_priv, &msgs[i]);
> 506                 }
> 507 
> 508                 if (ret == -ETIMEDOUT)
> 509                         goto timeout;
> 510                 if (ret == -ENXIO)
> 511                         goto clear_err;
> 512 
> 513                 ret = gmbus_wait_hw_status(dev_priv, GMBUS_HW_WAIT_PHASE,
> 514                                            GMBUS_HW_WAIT_EN);
> 515                 if (ret == -ENXIO)
> 516                         goto clear_err;
> 517                 if (ret)
> 518                         goto timeout;
> 519         }
> 
> Since you can be potentially stuck in:
> 497         for (; i < num; i += inc) {
> 
> forever.
> 
> Thank you,
> Zoran

Please, disregard/discard my message. I over combined. The code is OK.

I need to think more... About this case.

One question here to explore is the following: to have while booting to Linux 0 displays (no ANY monitor) attached -- headless booting via UEFI BIOS + GOP used, and then, after system comes to Linux kernel 4.2+ and later, to see if this bug shows again (attach monitor after kernel login screen)?

Thank you,
Zoran
Comment 6 Zoran Stojsavljevic 2016-02-10 17:28:21 UTC
Hello OTC team,

Today, as I have promised, Werner Zeh from Siemens MC came to IMU Feldkirchen, to work with me together on the DDI Port C BYT-M (N2807) issue/bug. 

The following we did, in order to close to the potential FW problem and to find the Root Cause of this problem.
[1] I did use my BBAY Fab. 3 CRB, swapping E3826 (two core BYT-I) with issue infested N2807 BYT-M. The reballed "bad" N2807 worked immediately with BBAY Fab. 3.
[2] Once I had problematic N2807, I did verify all the parameters from CCG internal BIOS X64.A093.R42 I have build, to check the validity of this BIOS.
[3] The BIOS shows correct 0x30678 N2807 BYT-M CPUID, as well as the latest used MCU 833.
[4] The BIOS implemented (assembled by me) appears to be UEFI compliant BIOS, 64 bit one, visually checked with version, as well as with .efi 32/64 size checker.
[5] With this BIOS (X64.A093.R42) it is IMPOSSIBLE to experience/show this issue with EMGD vBIOS 3909, as well as with GOP 7.2.1013 (used Fedora 23, kernel 4.3.5-300.fc23.x86_64)!
[6] Then we switch gears to FSP MR4/MR5 (irrelevant, both work the same way), and single channel N2807 does expose/show this issue very clearly.
[7] There was investigation going on, so we concluded that something in FSP is not either initialized accordingly with UEFI BIOS, or there is time de-synchronization.
[8] I set this use case according to INTEL rules to prove this issue on BBAY Fab.3 CRB, with FSP used.
[9] I have BBAY Fab.3 CRB with N2807 and FSP as real prove that we, INTEL, have the problem!

Now... I am just wondering if anything can/could be done from OTC/booting kernel levels, so some additional registers can be initialized by i915 driver to solve this issue (it is clear that issue comes from legacy BIOS/FSP levels, which does not necessarily mean that it could not be fixed by Linux kernel using i915 GFX driver). 

Thank you for understanding,
Zoran
Comment 7 Ville Syrjala 2016-02-10 18:41:38 UTC
(In reply to Zoran Stojsavljevic from comment #6)
> Hello OTC team,
> 
> Today, as I have promised, Werner Zeh from Siemens MC came to IMU
> Feldkirchen, to work with me together on the DDI Port C BYT-M (N2807)
> issue/bug. 
> 
> The following we did, in order to close to the potential FW problem and to
> find the Root Cause of this problem.
> [1] I did use my BBAY Fab. 3 CRB, swapping E3826 (two core BYT-I) with issue
> infested N2807 BYT-M. The reballed "bad" N2807 worked immediately with BBAY
> Fab. 3.
> [2] Once I had problematic N2807, I did verify all the parameters from CCG
> internal BIOS X64.A093.R42 I have build, to check the validity of this BIOS.
> [3] The BIOS shows correct 0x30678 N2807 BYT-M CPUID, as well as the latest
> used MCU 833.
> [4] The BIOS implemented (assembled by me) appears to be UEFI compliant
> BIOS, 64 bit one, visually checked with version, as well as with .efi 32/64
> size checker.
> [5] With this BIOS (X64.A093.R42) it is IMPOSSIBLE to experience/show this
> issue with EMGD vBIOS 3909, as well as with GOP 7.2.1013 (used Fedora 23,
> kernel 4.3.5-300.fc23.x86_64)!
> [6] Then we switch gears to FSP MR4/MR5 (irrelevant, both work the same
> way), and single channel N2807 does expose/show this issue very clearly.
> [7] There was investigation going on, so we concluded that something in FSP
> is not either initialized accordingly with UEFI BIOS, or there is time
> de-synchronization.
> [8] I set this use case according to INTEL rules to prove this issue on BBAY
> Fab.3 CRB, with FSP used.
> [9] I have BBAY Fab.3 CRB with N2807 and FSP as real prove that we, INTEL,
> have the problem!
> 
> Now... I am just wondering if anything can/could be done from OTC/booting
> kernel levels, so some additional registers can be initialized by i915
> driver to solve this issue (it is clear that issue comes from legacy
> BIOS/FSP levels, which does not necessarily mean that it could not be fixed
> by Linux kernel using i915 GFX driver). 

First, what is FSP?

And second, we're still waiting for that debug log Jani requested.
Comment 8 Zoran Stojsavljevic 2016-02-12 10:18:00 UTC
> (In reply to Ville Syrjala from comment #7)
> 
> First, what is FSP?
> 
> And second, we're still waiting for that debug log Jani requested.

Hello Ville,

First, FSP stands for Firmware Support Package, and, basically this is PEI section of BIOS withdrawn from INTEL BIOS itself, and compiled and linked as binary blob to be integrated into Coreboot boot loader (coreboot.org).

More about FSP you can find here:
http://www.intel.com/content/www/us/en/intelligent-systems/intel-firmware-support-package/intel-fsp-overview.html

It is open site (publicly available).

Second, I am really sorry, Werner and me did many tests together, but I forgot to do this test (asked by Jani), and Werner had his Fedora 23 with 4.4.0-rc3+ HDD completely corrupted. Today morning I compiled for him clone of my Fedora 23 HDD (threw few more updates there) and sent him this HDD (HDD on its way to Erlangen), it'll reach him on Monday (February 15th. 2016).

Once Werner receives the HDD (I tested cloned HDD, it does work), Werner will perform this test and will post results here.

Please, do note that I am from February 16th, 2016, till End of February 2016 on well-deserved vacation. Please, all other actions in the mean time do with Werner.

Thank you,
Zoran
Comment 9 Ricardo 2017-02-21 01:24:36 UTC
There is no update for several months on this, logs were requested if the issue persisted, will leave the bug open for 30 days, if no response is received will be closed
Comment 10 Werner Zeh 2017-02-21 09:19:41 UTC
That issue has been resolved on BIOS level. You can have a look at http://review.coreboot.org/#/c/13743 for details.

Sorry that I have completely forgotten to update the status here.
We can close it now.
Comment 11 yann 2017-02-21 09:29:45 UTC
(In reply to Werner Zeh from comment #10)
> That issue has been resolved on BIOS level. You can have a look at
> http://review.coreboot.org/#/c/13743 for details.
> 
> Sorry that I have completely forgotten to update the status here.
> We can close it now.

Thanks Werner Zeh for your confirmation. Closing now the bug ticket.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.