Created attachment 138696 [details] dmesg_drm-tip_4.16.0-rc7+ It occurs that Supermicro boards can't be booted using drm-tip. We see the issue on specifically this hardware: * Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0c 10/06/2017 * video card device ID 0x191d OS: CentOS 7 (7.4), kernel updated to drm-tip taken last week @ this revision: commit 29940f138482ff38047287ad288cea1fcf1f73b4 Author: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com> Date: Tue Apr 3 16:24:20 2018 +0300 drm-tip: 2018y-04m-03d-13h-23m-36s UTC integration manifest An important note: The systems can be loaded with additional parameters: "intel_pstate=disable intel_idle.max_cstate=1" however we see warnings from i915: [ 2.978031] [drm:skl_set_cdclk [i915]] *ERROR* DPLL0 not locked [ 2.991381] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915] [ 3.004969] Modules linked in: i915(+) uas usb_storage sd_mod drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb ahci libahci libata dca crc32c_intel i2c_algo_bit i2c_core video [ 3.034581] RIP: 0010:skl_get_cdclk+0x228/0x260 [i915] [ 3.131641] intel_update_cdclk+0x1c/0x60 [i915] [ 3.139084] skl_init_cdclk+0xe2/0x1c0 [i915] [ 3.146531] intel_power_domains_init_hw+0x70c/0x9a0 [i915] [ 3.153976] i915_driver_load+0x455/0xf20 [i915] [ 3.374369] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915] [ 3.399330] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915] [ 3.411999] Modules linked in: i915(+) uas usb_storage sd_mod drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb ahci libahci libata dca crc32c_intel i2c_algo_bit i2c_core video [ 3.439025] RIP: 0010:skl_get_cdclk+0x228/0x260 [i915] [ 3.531009] gen9_dc_off_power_well_enable+0x52/0x210 [i915] [ 3.538334] intel_power_well_enable+0x36/0x40 [i915] [ 3.545707] __intel_display_power_get_domain+0x7c/0x90 [i915] [ 3.553179] intel_display_power_get+0x2e/0x40 [i915] [ 3.560683] intel_display_set_init_power+0x33/0x40 [i915] [ 3.568215] intel_power_domains_init_hw+0x6e/0x9a0 [i915] [ 3.575766] i915_driver_load+0x455/0xf20 [i915] [ 3.799758] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915] [ 3.818843] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27) [ 4.122568] [drm] Initialized i915 1.6.0 20180308 for 0000:00:02.0 on minor 0
Could you provide a dmesg log booting with drm.debug=0xe ?
Created attachment 138707 [details] dmesg with drm.debug=0xe
(In reply to Dmitry Ermilov from comment #2) > Created attachment 138707 [details] > dmesg with drm.debug=0xe I just realized that dmesg output with drm.debug=0xe doesn't have the warning mentioned in the description. That might be a random behavior. We need to double-check.
Created attachment 138726 [details] dmesg with drm.debug=0xe and i915 warnings Indeed, the warning is random. Attaching new dmesg output.
Created attachment 138761 [details] [review] Revert WA 1183 Looks like the reference PLL for CDCLK, DPLL0 doesn't lock, but no other indication as to the root cause for that. Could you clarify your report that the board can't be booted? Are there only two scenarios: 1. The board boots with the error messages 2. The board boots without the error messages Or is there another scenario which you refer to as "can't be booted"? In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)? Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)? Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip? I attached the revert patch since it won't revert cleanly. Thanks.
> Or is there another scenario which you refer to as "can't be booted"? Yes, the third scenario is SM boards can't boot at all without "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1". We plan to check which exact option among these helps to boot. Actually this (the 3rd) scenario is the one we wanted to bring to you. Presence of the warning is just an observation, I'm not sure if it's relevant. > In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)? Yes, it's bug 105955. Generally at first we came across bug 105955. And after some experiments we discovered the problem described in the current bug. > Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)? Not yet. To get the ball rolling we can try to revert https://patchwork.freedesktop.org/patch/65235/. > Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip? Sure. Thank you. Regards, Dmitry
(In reply to Dmitry Ermilov from comment #6) > > Or is there another scenario which you refer to as "can't be booted"? > Yes, the third scenario is SM boards can't boot at all without > "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1". > We plan to check which exact option among these helps to boot. > Actually this (the 3rd) scenario is the one we wanted to bring to you. > Presence of the warning is just an observation, I'm not sure if it's > relevant. If you're using the system without a display then I'm guessing the presence of this error shouldn't cause a boot failure. > > In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)? > Yes, it's bug 105955. Generally at first we came across bug 105955. And > after some experiments we discovered the problem described in the current > bug. Ok, so the lag happens even without the CDCLK PLL enabling issues reported here. > > Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)? > Not yet. To get the ball rolling we can try to revert > https://patchwork.freedesktop.org/patch/65235/. That one re-enabled display C (DC) states. Another way to test if DC states cause the problem is to boot with i915.enable_dc=0 kernel parameter. > > Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip? > Sure. Thank you. > > Regards, > Dmitry
Hi, All The test without parameters "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1" and split of parameters doesn't show any pattern for stability behavior. Many hangs or clear dmesg without parameters and per item of all... P.S.For explaining: previous tests showed hangs 6 times without all of parameters. The setup didn't change from previous time..
Checking with WA patch brings nothing. Hangs when we tries to boot with patched kernel...
(In reply to Eugeny from comment #9) > Checking with WA patch brings nothing. Hangs when we tries to boot with > patched kernel... Ok, I'd be still interested to find the reason for the PLL lock issue, which could be a separate problem. Could you still see if the cases where it boots up produces the error with the revert? Thanks.
Created attachment 138906 [details] dmesg_drm-tip_supermicro_v3.log I`ve got dmesg after many retries.
Created attachment 138930 [details] [review] Attempt to fix CDCLK PLL lock failure (In reply to Eugeny from comment #11) > Created attachment 138906 [details] > dmesg_drm-tip_supermicro_v3.log > > I`ve got dmesg after many retries. Ok, thanks a lot for trying. After discussing with Ville it's possible that WA#1183 is actually needed for this case, but needs to be applied earlier before disabling the CDCLK PLL. The attached patch does this and also fixes the DMC specific part of the WA; could you give it a try? Again I'd be interested if it fixes the hang, but if not, it'd be good to know if it gets rid of the PLL errors at least (in the successful boots). In case it doesn't fix the hang: do I understand correctly that you don't have any workaround to get rid of the hang in a consistent way? Note that the i915.enable_rc6 option has been removed. What about different kernel versions? Did you try to disable i915 in your kconfig? Thanks.
Ping Dmitry
Hi Francesco, Sorry for delayed reply. This is not abandoned by us. Eugeny will update a status of the patch which Imre suggested. >>do I understand correctly that you don't have any workaround to get rid of the hang in a consistent way? For us it's okay to have "intel_pstate=disable intel_idle.max_cstate=1".This options are default in our configuration. So this issue isn't a blocker for us. I submitted the bug just to let you know about issue because it might affect others. >>Note that the i915.enable_rc6 option has been removed. Yes, we know. This is from default sets of options we use which left from old kernels. >>What about different kernel versions? We didn't try. >>Did you try to disable i915 in your kconfig? We didn't try.
Created attachment 139406 [details] dmesg v4 Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch unexpectedly ends in middle of line" but still I tried built it and got dmesg after 10 tries.
(In reply to Eugeny from comment #15) > Created attachment 139406 [details] > dmesg v4 > > Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on > HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch > unexpectedly ends in middle of line" The patch is complete, so something is definitely wrong here... > > but still I tried built it and got dmesg after 10 tries. Anyways, we let this bug fall through the cracks, and I am sorry about that! Do you think it is possible for you to re-test again?
(In reply to Martin Peres from comment #16) > (In reply to Eugeny from comment #15) > > Created attachment 139406 [details] > > dmesg v4 > > > > Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on > > HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch > > unexpectedly ends in middle of line" > > The patch is complete, so something is definitely wrong here... > > > > > but still I tried built it and got dmesg after 10 tries. > > Anyways, we let this bug fall through the cracks, and I am sorry about that! > > Do you think it is possible for you to re-test again? There was an offline discussion where the conclusion was that there is no problem with the programming sequence of PLL enabling in the driver, at least based on BSpec. So no need to try the patch. Extending the PLL lock timeout makes the bug more difficult to reproduce, but even with a 1 sec timeout it happens from time to time. So we need to debug further the root cause of the lock failure and possibly try to use an alternative clock source. Since the problem looks related to this specific system (SuperMicro SKL) we need a version of that board with JTAG connector and JTAG debugging enabled. I asked Eugeny to contact the vendor if it's possible to get such a board. Accordingly setting this to NEEDINFO.
I asked Eugeny to contact the vendor if it's possible to get such a > board. Any work around for this to proceed further if we don't have the board yet?
Imre, any updates on this issue?
(In reply to Lakshmi from comment #19) > Imre, any updates on this issue? Still waiting for the board with JTAG enabled, haven't heard from Dmitry for a while.
(In reply to Imre Deak from comment #20) > (In reply to Lakshmi from comment #19) > > Imre, any updates on this issue? > > Still waiting for the board with JTAG enabled, haven't heard from Dmitry for > a while. So far all efforts to enable JTAG on this SM board were not successful. Evgeny, can you please say what exactly doesn't work?
Dmitry, how to proceed with this bug?
JTAG is absent in SMC. We spent a lot of time trying to get it working using engineering samples but in vain. This activity has been abandoned. Please feel free to close the ticket since it seems there is no way to investigate the issue properly.
Thanks for the feedback- Closing this bug.
SMC board is avaialble for investigation.
Imre, any updates here?
Past SLA, any updates, should we reduce the priority on this one? #assessment
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/98.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.