HW: ATOM BYT-M (N2807) SW: Vanilla kernels 4.4.0_rc3+ ORIGINAL PROBLEM STATEMENT by Werner (Zeh): On our current design, where SKU N2807 is used, we can observe an error case where the display port sporadically fails. We have a DP->LVDS converter chip (PTN3460IBS) soldered on the board and therefore have a display port device hard connected to SoC. In the error case, data link training fails on display port. We did some measurements where one can see that there is absolutely no traffic on display port lane if the error happens. However, AUX-port seems to work fine as there is traffic on it when the error happens. We can even see the error in the Linux Kernel log, which provides the following messages: [drm] Initialized drm 1.1.0 20060810 [drm] Memory usable by graphics device = 2048M [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [drm] Driver supports precise vblank timestamp query. [drm] GMBUS [i915 gmbus dpc] timed out, falling back to bit banging on pin 4 fbcon: inteldrmfb (fb0) is primary device tsc: Refined TSC clocksource calibration: 1583.333 MHz [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off Switched to clocksource tsc [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_start_link_train] *ERROR* too many full retries, give up [drm:intel_dp_complete_link_train] *ERROR* failed to train DP, aborting ------------[ cut here ]------------ WARNING: CPU: 1 PID: 1 at drivers/gpu/drm/i915/intel_display.c:1539 vlv_wait_port_ready+0xf3/0x110() timed out waiting for port C ready: 0xf000a0ff Modules linked in: CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.14 #1-V%#.%#.#%.#% 00000000 00000000 9a437824 814ad4cc 9a437864 9a437854 8103db6e 81647f58 9a437880 00000001 816467ec 00000603 812e3373 812e3373 99c08000 000000f0 ffff9208 9a43786c 8103dbc3 00000009 9a437864 81647f58 9a437880 9a437898 Call Trace: [<814ad4cc>] dump_stack+0x48/0x69 [<8103db6e>] warn_slowpath_common+0x7e/0xa0 [<812e3373>] ? vlv_wait_port_ready+0xf3/0x110 [<812e3373>] ? vlv_wait_port_ready+0xf3/0x110 [<8103dbc3>] warn_slowpath_fmt+0x33/0x40 [<812e3373>] vlv_wait_port_ready+0xf3/0x110 [<812fdb92>] vlv_pre_enable_dp+0xd2/0x130 [<812e7c62>] valleyview_crtc_enable+0x102/0x3b0 [<812fa21f>] ? intel_dp_mode_set+0x2f/0x300 [<812ea226>] __intel_set_mode+0x6f6/0x940 [<812ecfe3>] intel_set_mode+0x23/0x40 [<812ed809>] intel_crtc_set_config+0x719/0x8f0 [<812a31eb>] drm_mode_set_config_internal+0x4b/0xc0 [<81295825>] drm_fb_helper_set_par+0x185/0x200 [<81227ef2>] fbcon_init+0x502/0x550 [<812729ce>] visual_init+0x9e/0x100 [<81274cd6>] do_bind_con_driver+0x106/0x2f0 [<811623fc>] ? sysfs_create_file_ns+0x2c/0x30 [<812753cd>] do_take_over_console+0xfd/0x190 [<812256bf>] do_fbcon_takeover+0x5f/0xc0 [<8122aa3f>] fbcon_event_notify+0x6ef/0x7f0 [<8105e031>] notifier_call_chain+0x41/0x60 [<8105e36b>] __blocking_notifier_call_chain+0x3b/0x60 [<8105e3af>] blocking_notifier_call_chain+0x1f/0x30 [<8121cd66>] fb_notifier_call_chain+0x16/0x20 [<8121e76f>] register_framebuffer+0x1af/0x2b0 [<81295434>] drm_fb_helper_initial_config+0x2d4/0x470 [<8131d6f4>] ? gen6_write32+0x64/0x120 [<81104b88>] ? kmem_cache_alloc_trace+0x128/0x130 [<81293b99>] ? drm_fb_helper_init+0xf9/0x160 [<8132412e>] intel_fbdev_initial_config+0x1e/0x20 [<812bb62c>] i915_driver_load+0xc9c/0xcd0 [<8140c4b0>] ? hiddev_disconnect+0x90/0x90 [<8129dfda>] drm_dev_register+0x6a/0x140 [<812a0171>] drm_get_pci_dev+0xc1/0x1e0 [<811657a5>] ? kernfs_create_link+0x55/0x90 [<812b7ec5>] i915_pci_probe+0x35/0x60 [<812109df>] pci_device_probe+0x5f/0xb0 [<81162955>] ? sysfs_create_link+0x25/0x40 [<81342ec3>] really_probe+0x53/0x1f0 [<81210712>] ? pci_match_device+0xb2/0xc0 [<81343127>] __driver_attach+0x77/0x80 [<813430b0>] ? __device_attach+0x50/0x50 [<81341687>] bus_for_each_dev+0x47/0x80 [<81342abe>] driver_attach+0x1e/0x20 [<813430b0>] ? __device_attach+0x50/0x50 [<813427af>] bus_add_driver+0x13f/0x1f0 [<813434f9>] driver_register+0x59/0xe0 [<81210822>] __pci_register_driver+0x32/0x40 [<812a0392>] drm_pci_init+0x102/0x110 [<817542b2>] ? ttm_init+0x64/0x64 [<81754314>] i915_init+0x62/0x64 [<81000472>] do_one_initcall+0xd2/0x120 [<8115a7bb>] ? __proc_create+0x9b/0xd0 [<810583e8>] ? parameq+0x18/0x70 [<817274a1>] ? do_early_param+0x78/0x78 [<81727400>] ? loglevel+0x2/0x2b [<8105861f>] ? parse_args+0x1df/0x330 [<81078a7f>] ? __wake_up+0x3f/0x50 [<81727b04>] kernel_init_freeable+0xe8/0x18f [<817274a1>] ? do_early_param+0x78/0x78 [<814a68b0>] kernel_init+0x10/0xe0 [<814b44b7>] ret_from_kernel_thread+0x1b/0x28 [<814a68a0>] ? rest_init+0x80/0x80 ---[ end trace c03f51a8b9c35138 ]--- fbcon_init: disable boot-logo (boot-logo bigger than screen). Console: switching to colour frame buffer device 240x67 i915 0000:00:02.0: fb0: inteldrmfb frame buffer device i915 0000:00:02.0: registered panic notifier [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0 If this error happens, a reset of the SoC only resolves the issue (the rest of the board and especially the DP->LVDS converter stays unchanged). We have tried to disable the display port in Linux driver as soon as the error happens with no result. _______ PROBLEM DESCRIPTION by Zoran (Stojsavljevic): Hello Werner, Upon reading your latest findings, it is obvious that there are FW and SW bugs in different scenarios preventing N2807 GFX port C to operate correctly. As my best understanding is, these are three use cases you did identify with the N2807 (BYT-M), using actual sku and INTEL Bayley Bay CRb Fab.3 (Here I repeat, adding course of action I'll take to the escalation path within INTEL for FSP, and Open Source). [1] UEFI BIOS used (along with integrated GOP driver) - problem NOT reproducible, reference design to be achieved by other two cases; [2] FSP + Coreboot + vBIOS used + SeaBIOS + Linux kernel - problem is visible - will create FSP HSD against PED FSP team; [3] FSP + Coreboot (vBIOS NOT used), ONLY Linux kernel driver is used to init GFX - problem is visible - will create Bugzilla system entry for GFX OTC INTEL team. Since you have reproduced this problem on the Bayley Bay Fab. 3 CRB, I am engaging INTEL OTC team in further resolution of this problem. Thank you, Zoran
(In reply to Zoran Stojsavljevic from comment #0) > SW: Vanilla kernels 4.4.0_rc3+ ... > CPU: 1 PID: 1 Comm: swapper/0 Not tainted 3.14 #1-V%#.%#.#%.#% Please add drm.debug=14 module parameter and attach dmesg from boot to the problem, running v4.4 or later.
Created attachment 121460 [details] [review] Patch to fix i2c for passive adapters Please try back porting the attached patch to the customer's kernel and see if it helps. This patch was the result of the debugging we did last summer with dongle issues, which ended up being a general i2c problem rather than something specific to the i915. I know the adaptor is on-board, but this fix could still apply. I don't remember for sure if this one landed in time for Linux 4.1 or if it was 4.2.
Hi. I have checked our kernel, which is version 4.2.0.16. The mentioned patch is already in that kernel which shows the display port link training issue. To me, it seems like the PHY of display port links is not running properly. This is indicated by DPLLA_CTRL-Reg, where the lower byte holds the phy status for both ports. As it seems not to happen with GOP+UEFI, maybe some registers are initialized in a different way using the kernel driver. And I just want to highlight that this error happens only on few boards, not on every. And these boards are affected shows a temperature dependency: the error is more likely to happen when SoC is cold.
Hello Jim, I would write this code differently: 482 struct drm_i915_private *dev_priv = bus->dev_priv; 483 -- int i = 0, inc, try = 0; 483 ++ int i = 0, inc = 1, try = 0; 484 int ret = 0; 485 486 intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); 487 mutex_lock(&dev_priv->gmbus_mutex); 488 489 if (bus->force_bit) { 490 ret = i2c_bit_algo.master_xfer(adapter, msgs, num); 491 goto out; 492 } 493 494 retry: 495 I915_WRITE(GMBUS0, bus->reg0); 496 497 for (; i < num; i += inc) { 498 -- inc = 1; 499 if (gmbus_is_index_read(msgs, i, num)) { 500 ret = gmbus_xfer_index_read(dev_priv, &msgs[i]); 501 inc = 2; /* an index read is two msgs */ 502 } else if (msgs[i].flags & I2C_M_RD) { 503 ret = gmbus_xfer_read(dev_priv, &msgs[i], 0); 504 } else { 505 ret = gmbus_xfer_write(dev_priv, &msgs[i]); 506 } 507 508 if (ret == -ETIMEDOUT) 509 goto timeout; 510 if (ret == -ENXIO) 511 goto clear_err; 512 513 ret = gmbus_wait_hw_status(dev_priv, GMBUS_HW_WAIT_PHASE, 514 GMBUS_HW_WAIT_EN); 515 if (ret == -ENXIO) 516 goto clear_err; 517 if (ret) 518 goto timeout; 519 } Since you can be potentially stuck in: 497 for (; i < num; i += inc) { forever. Thank you, Zoran
(In reply to Zoran Stojsavljevic from comment #4) > Hello Jim, > > I would write this code differently: > > 482 struct drm_i915_private *dev_priv = bus->dev_priv; > 483 -- int i = 0, inc, try = 0; > 483 ++ int i = 0, inc = 1, try = 0; > 484 int ret = 0; > 485 > 486 intel_display_power_get(dev_priv, POWER_DOMAIN_GMBUS); > 487 mutex_lock(&dev_priv->gmbus_mutex); > 488 > 489 if (bus->force_bit) { > 490 ret = i2c_bit_algo.master_xfer(adapter, msgs, num); > 491 goto out; > 492 } > 493 > 494 retry: > 495 I915_WRITE(GMBUS0, bus->reg0); > 496 > 497 for (; i < num; i += inc) { > 498 -- inc = 1; > 499 if (gmbus_is_index_read(msgs, i, num)) { > 500 ret = gmbus_xfer_index_read(dev_priv, &msgs[i]); > 501 inc = 2; /* an index read is two msgs */ > 502 } else if (msgs[i].flags & I2C_M_RD) { > 503 ret = gmbus_xfer_read(dev_priv, &msgs[i], 0); > 504 } else { > 505 ret = gmbus_xfer_write(dev_priv, &msgs[i]); > 506 } > 507 > 508 if (ret == -ETIMEDOUT) > 509 goto timeout; > 510 if (ret == -ENXIO) > 511 goto clear_err; > 512 > 513 ret = gmbus_wait_hw_status(dev_priv, GMBUS_HW_WAIT_PHASE, > 514 GMBUS_HW_WAIT_EN); > 515 if (ret == -ENXIO) > 516 goto clear_err; > 517 if (ret) > 518 goto timeout; > 519 } > > Since you can be potentially stuck in: > 497 for (; i < num; i += inc) { > > forever. > > Thank you, > Zoran Please, disregard/discard my message. I over combined. The code is OK. I need to think more... About this case. One question here to explore is the following: to have while booting to Linux 0 displays (no ANY monitor) attached -- headless booting via UEFI BIOS + GOP used, and then, after system comes to Linux kernel 4.2+ and later, to see if this bug shows again (attach monitor after kernel login screen)? Thank you, Zoran
Hello OTC team, Today, as I have promised, Werner Zeh from Siemens MC came to IMU Feldkirchen, to work with me together on the DDI Port C BYT-M (N2807) issue/bug. The following we did, in order to close to the potential FW problem and to find the Root Cause of this problem. [1] I did use my BBAY Fab. 3 CRB, swapping E3826 (two core BYT-I) with issue infested N2807 BYT-M. The reballed "bad" N2807 worked immediately with BBAY Fab. 3. [2] Once I had problematic N2807, I did verify all the parameters from CCG internal BIOS X64.A093.R42 I have build, to check the validity of this BIOS. [3] The BIOS shows correct 0x30678 N2807 BYT-M CPUID, as well as the latest used MCU 833. [4] The BIOS implemented (assembled by me) appears to be UEFI compliant BIOS, 64 bit one, visually checked with version, as well as with .efi 32/64 size checker. [5] With this BIOS (X64.A093.R42) it is IMPOSSIBLE to experience/show this issue with EMGD vBIOS 3909, as well as with GOP 7.2.1013 (used Fedora 23, kernel 4.3.5-300.fc23.x86_64)! [6] Then we switch gears to FSP MR4/MR5 (irrelevant, both work the same way), and single channel N2807 does expose/show this issue very clearly. [7] There was investigation going on, so we concluded that something in FSP is not either initialized accordingly with UEFI BIOS, or there is time de-synchronization. [8] I set this use case according to INTEL rules to prove this issue on BBAY Fab.3 CRB, with FSP used. [9] I have BBAY Fab.3 CRB with N2807 and FSP as real prove that we, INTEL, have the problem! Now... I am just wondering if anything can/could be done from OTC/booting kernel levels, so some additional registers can be initialized by i915 driver to solve this issue (it is clear that issue comes from legacy BIOS/FSP levels, which does not necessarily mean that it could not be fixed by Linux kernel using i915 GFX driver). Thank you for understanding, Zoran
(In reply to Zoran Stojsavljevic from comment #6) > Hello OTC team, > > Today, as I have promised, Werner Zeh from Siemens MC came to IMU > Feldkirchen, to work with me together on the DDI Port C BYT-M (N2807) > issue/bug. > > The following we did, in order to close to the potential FW problem and to > find the Root Cause of this problem. > [1] I did use my BBAY Fab. 3 CRB, swapping E3826 (two core BYT-I) with issue > infested N2807 BYT-M. The reballed "bad" N2807 worked immediately with BBAY > Fab. 3. > [2] Once I had problematic N2807, I did verify all the parameters from CCG > internal BIOS X64.A093.R42 I have build, to check the validity of this BIOS. > [3] The BIOS shows correct 0x30678 N2807 BYT-M CPUID, as well as the latest > used MCU 833. > [4] The BIOS implemented (assembled by me) appears to be UEFI compliant > BIOS, 64 bit one, visually checked with version, as well as with .efi 32/64 > size checker. > [5] With this BIOS (X64.A093.R42) it is IMPOSSIBLE to experience/show this > issue with EMGD vBIOS 3909, as well as with GOP 7.2.1013 (used Fedora 23, > kernel 4.3.5-300.fc23.x86_64)! > [6] Then we switch gears to FSP MR4/MR5 (irrelevant, both work the same > way), and single channel N2807 does expose/show this issue very clearly. > [7] There was investigation going on, so we concluded that something in FSP > is not either initialized accordingly with UEFI BIOS, or there is time > de-synchronization. > [8] I set this use case according to INTEL rules to prove this issue on BBAY > Fab.3 CRB, with FSP used. > [9] I have BBAY Fab.3 CRB with N2807 and FSP as real prove that we, INTEL, > have the problem! > > Now... I am just wondering if anything can/could be done from OTC/booting > kernel levels, so some additional registers can be initialized by i915 > driver to solve this issue (it is clear that issue comes from legacy > BIOS/FSP levels, which does not necessarily mean that it could not be fixed > by Linux kernel using i915 GFX driver). First, what is FSP? And second, we're still waiting for that debug log Jani requested.
> (In reply to Ville Syrjala from comment #7) > > First, what is FSP? > > And second, we're still waiting for that debug log Jani requested. Hello Ville, First, FSP stands for Firmware Support Package, and, basically this is PEI section of BIOS withdrawn from INTEL BIOS itself, and compiled and linked as binary blob to be integrated into Coreboot boot loader (coreboot.org). More about FSP you can find here: http://www.intel.com/content/www/us/en/intelligent-systems/intel-firmware-support-package/intel-fsp-overview.html It is open site (publicly available). Second, I am really sorry, Werner and me did many tests together, but I forgot to do this test (asked by Jani), and Werner had his Fedora 23 with 4.4.0-rc3+ HDD completely corrupted. Today morning I compiled for him clone of my Fedora 23 HDD (threw few more updates there) and sent him this HDD (HDD on its way to Erlangen), it'll reach him on Monday (February 15th. 2016). Once Werner receives the HDD (I tested cloned HDD, it does work), Werner will perform this test and will post results here. Please, do note that I am from February 16th, 2016, till End of February 2016 on well-deserved vacation. Please, all other actions in the mean time do with Werner. Thank you, Zoran
There is no update for several months on this, logs were requested if the issue persisted, will leave the bug open for 30 days, if no response is received will be closed
That issue has been resolved on BIOS level. You can have a look at http://review.coreboot.org/#/c/13743 for details. Sorry that I have completely forgotten to update the status here. We can close it now.
(In reply to Werner Zeh from comment #10) > That issue has been resolved on BIOS level. You can have a look at > http://review.coreboot.org/#/c/13743 for details. > > Sorry that I have completely forgotten to update the status here. > We can close it now. Thanks Werner Zeh for your confirmation. Closing now the bug ticket.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.