Bug 105949 - System stall on Supermicro SKL boards on drm-tip
Summary: System stall on Supermicro SKL boards on drm-tip
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other Linux (All)
: high major
Assignee: Imre Deak
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks: 105955
  Show dependency treegraph
 
Reported: 2018-04-09 10:11 UTC by Dmitry Ermilov
Modified: 2019-11-29 17:44 UTC (History)
5 users (show)

See Also:
i915 platform: SKL
i915 features: power/Other


Attachments
dmesg_drm-tip_4.16.0-rc7+ (71.32 KB, text/plain)
2018-04-09 10:11 UTC, Dmitry Ermilov
no flags Details
dmesg with drm.debug=0xe (81.23 KB, text/plain)
2018-04-09 19:23 UTC, Dmitry Ermilov
no flags Details
dmesg with drm.debug=0xe and i915 warnings (96.23 KB, text/plain)
2018-04-10 11:29 UTC, Dmitry Ermilov
no flags Details
Revert WA 1183 (4.09 KB, patch)
2018-04-11 15:50 UTC, Imre Deak
no flags Details | Splinter Review
dmesg_drm-tip_supermicro_v3.log (97.65 KB, text/plain)
2018-04-18 12:30 UTC, Eugeny
no flags Details
Attempt to fix CDCLK PLL lock failure (2.48 KB, patch)
2018-04-19 16:32 UTC, Imre Deak
no flags Details | Splinter Review
dmesg v4 (87.53 KB, text/plain)
2018-05-07 16:11 UTC, Eugeny
no flags Details

Description Dmitry Ermilov 2018-04-09 10:11:58 UTC
Created attachment 138696 [details]
dmesg_drm-tip_4.16.0-rc7+

It occurs that Supermicro boards can't be booted using drm-tip.
We see the issue on specifically this hardware:
* Supermicro SYS-5019S-ML/X11SSH-F, BIOS 2.0c 10/06/2017 
* video card device ID 0x191d

OS:
CentOS 7 (7.4), kernel updated to drm-tip taken last week @ this revision:
commit 29940f138482ff38047287ad288cea1fcf1f73b4
Author: Oleksandr Andrushchenko <oleksandr_andrushchenko@epam.com>
Date:   Tue Apr 3 16:24:20 2018 +0300
    drm-tip: 2018y-04m-03d-13h-23m-36s UTC integration manifest

An important note:
The systems can be loaded with additional parameters:
"intel_pstate=disable intel_idle.max_cstate=1"

however we see warnings from i915:
[    2.978031] [drm:skl_set_cdclk [i915]] *ERROR* DPLL0 not locked
[    2.991381] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915]
[    3.004969] Modules linked in: i915(+) uas usb_storage sd_mod drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb ahci libahci libata dca crc32c_intel i2c_algo_bit i2c_core video
[    3.034581] RIP: 0010:skl_get_cdclk+0x228/0x260 [i915]
[    3.131641]  intel_update_cdclk+0x1c/0x60 [i915]
[    3.139084]  skl_init_cdclk+0xe2/0x1c0 [i915]
[    3.146531]  intel_power_domains_init_hw+0x70c/0x9a0 [i915]
[    3.153976]  i915_driver_load+0x455/0xf20 [i915]
[    3.374369] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915]
[    3.399330] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915]
[    3.411999] Modules linked in: i915(+) uas usb_storage sd_mod drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm igb ahci libahci libata dca crc32c_intel i2c_algo_bit i2c_core video
[    3.439025] RIP: 0010:skl_get_cdclk+0x228/0x260 [i915]
[    3.531009]  gen9_dc_off_power_well_enable+0x52/0x210 [i915]
[    3.538334]  intel_power_well_enable+0x36/0x40 [i915]
[    3.545707]  __intel_display_power_get_domain+0x7c/0x90 [i915]
[    3.553179]  intel_display_power_get+0x2e/0x40 [i915]
[    3.560683]  intel_display_set_init_power+0x33/0x40 [i915]
[    3.568215]  intel_power_domains_init_hw+0x6e/0x9a0 [i915]
[    3.575766]  i915_driver_load+0x455/0xf20 [i915]
[    3.799758] WARNING: CPU: 7 PID: 312 at drivers/gpu/drm/i915/intel_cdclk.c:826 skl_get_cdclk+0x228/0x260 [i915]
[    3.818843] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)
[    4.122568] [drm] Initialized i915 1.6.0 20180308 for 0000:00:02.0 on minor 0
Comment 1 Imre Deak 2018-04-09 10:30:21 UTC
Could you provide a dmesg log booting with drm.debug=0xe ?
Comment 2 Dmitry Ermilov 2018-04-09 19:23:05 UTC
Created attachment 138707 [details]
dmesg with drm.debug=0xe
Comment 3 Dmitry Ermilov 2018-04-09 20:21:32 UTC
(In reply to Dmitry Ermilov from comment #2)
> Created attachment 138707 [details]
> dmesg with drm.debug=0xe

I just realized that dmesg output with drm.debug=0xe doesn't have the warning mentioned in the description. That might be a random behavior. We need to double-check.
Comment 4 Dmitry Ermilov 2018-04-10 11:29:25 UTC
Created attachment 138726 [details]
dmesg with drm.debug=0xe and i915 warnings

Indeed, the warning is random. Attaching new dmesg output.
Comment 5 Imre Deak 2018-04-11 15:50:08 UTC
Created attachment 138761 [details] [review]
Revert WA 1183

Looks like the reference PLL for CDCLK, DPLL0 doesn't lock, but no other indication as to the root cause for that.

Could you clarify your report that the board can't be booted? Are there only two scenarios:
1. The board boots with the error messages
2. The board boots without the error messages

Or is there another scenario which you refer to as "can't be booted"?

In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)?

Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)?

Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip? I attached the revert patch since it won't revert cleanly.

Thanks.
Comment 6 Dmitry Ermilov 2018-04-11 17:07:01 UTC
> Or is there another scenario which you refer to as "can't be booted"?
Yes, the third scenario is SM boards can't boot at all without "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1".
We plan to check which exact option among these helps to boot. 
Actually this (the 3rd) scenario is the one we wanted to bring to you. Presence of the warning is just an observation, I'm not sure if it's relevant.

> In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)?
Yes, it's bug 105955. Generally at first we came across bug 105955. And after some experiments we discovered the problem described in the current bug.

> Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)?
Not yet. To get the ball rolling we can try to revert https://patchwork.freedesktop.org/patch/65235/. 

> Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip?
Sure. Thank you.

Regards,
Dmitry
Comment 7 Imre Deak 2018-04-11 17:41:41 UTC
(In reply to Dmitry Ermilov from comment #6)
> > Or is there another scenario which you refer to as "can't be booted"?
> Yes, the third scenario is SM boards can't boot at all without
> "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1".
> We plan to check which exact option among these helps to boot. 
> Actually this (the 3rd) scenario is the one we wanted to bring to you.
> Presence of the warning is just an observation, I'm not sure if it's
> relevant.

If you're using the system without a display then I'm guessing the presence of this error shouldn't cause a boot failure.

> > In the case where booting without the error messages, is there some other issue you see (system stall reported on bug 105949)?
> Yes, it's bug 105955. Generally at first we came across bug 105955. And
> after some experiments we discovered the problem described in the current
> bug.

Ok, so the lag happens even without the CDCLK PLL enabling issues reported here.

> > Did you try to bisect the problem causing the above error messages (noting that you need several boots for each bisect step)?
> Not yet. To get the ball rolling we can try to revert
> https://patchwork.freedesktop.org/patch/65235/. 

That one re-enabled display C (DC) states. Another way to test if DC states cause the problem is to boot with i915.enable_dc=0 kernel parameter.

> > Specifically there's been a recent related WA, could you try if the problem persists reverting that on top of drm-tip?
> Sure. Thank you.
> 
> Regards,
> Dmitry
Comment 8 Eugeny 2018-04-16 14:26:48 UTC
Hi, All

The test without parameters "intel_pstate=disable i915.enable_rc6=0 intel_idle.max_cstate=1" and split of parameters doesn't show any pattern for stability behavior. Many hangs or clear dmesg without parameters and per item of all...

P.S.For explaining: previous tests showed hangs 6 times without all of parameters. The setup didn't change from previous time..
Comment 9 Eugeny 2018-04-17 15:26:43 UTC
Checking with WA patch brings nothing. Hangs when we tries to boot with patched kernel...
Comment 10 Imre Deak 2018-04-17 15:34:26 UTC
(In reply to Eugeny from comment #9)
> Checking with WA patch brings nothing. Hangs when we tries to boot with
> patched kernel...

Ok, I'd be still interested to find the reason for the PLL lock issue, which could be a separate problem. Could you still see if the cases where it boots up produces the error with the revert?

Thanks.
Comment 11 Eugeny 2018-04-18 12:30:59 UTC
Created attachment 138906 [details]
dmesg_drm-tip_supermicro_v3.log

I`ve got dmesg after many retries.
Comment 12 Imre Deak 2018-04-19 16:32:09 UTC
Created attachment 138930 [details] [review]
Attempt to fix CDCLK PLL lock failure

(In reply to Eugeny from comment #11)
> Created attachment 138906 [details]
> dmesg_drm-tip_supermicro_v3.log
> 
> I`ve got dmesg after many retries.

Ok, thanks a lot for trying.

After discussing with Ville it's possible that WA#1183 is actually needed for this case, but needs to be applied earlier before disabling the CDCLK PLL. The attached patch does this and also fixes the DMC specific part of the WA; could you give it a try? Again I'd be interested if it fixes the hang, but if not, it'd be good to know if it gets rid of the PLL errors at least (in the successful boots).

In case it doesn't fix the hang: do I understand correctly that you don't have any workaround to get rid of the hang in a consistent way? Note that the i915.enable_rc6 option has been removed. What about different kernel versions? Did you try to disable i915 in your kconfig?

Thanks.
Comment 13 Francesco Balestrieri 2018-05-07 13:41:59 UTC
Ping Dmitry
Comment 14 Dmitry Ermilov 2018-05-07 14:17:52 UTC
Hi Francesco,

Sorry for delayed reply. This is not abandoned by us. Eugeny will update a status of the patch which Imre suggested.

>>do I understand correctly that you don't have any workaround to get rid of the hang in a consistent way? 
For us it's okay to have "intel_pstate=disable intel_idle.max_cstate=1".This options are default in our configuration. So this issue isn't a blocker for us. I submitted the bug just to let you know about issue because it might affect others.

>>Note that the i915.enable_rc6 option has been removed. 
Yes, we know. This is from default sets of options we use which left from old kernels. 

>>What about different kernel versions? 
We didn't try. 

>>Did you try to disable i915 in your kconfig?
We didn't try.
Comment 15 Eugeny 2018-05-07 16:11:00 UTC
Created attachment 139406 [details]
dmesg v4

Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch unexpectedly ends in middle of line"

but still I tried built it and got dmesg after 10 tries.
Comment 16 Martin Peres 2018-07-17 14:19:21 UTC
(In reply to Eugeny from comment #15)
> Created attachment 139406 [details]
> dmesg v4
> 
> Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on
> HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch
> unexpectedly ends in middle of line"

The patch is complete, so something is definitely wrong here...

> 
> but still I tried built it and got dmesg after 10 tries.

Anyways, we let this bug fall through the cracks, and I am sorry about that!

Do you think it is possible for you to re-test again?
Comment 17 Imre Deak 2018-07-17 14:41:45 UTC
(In reply to Martin Peres from comment #16)
> (In reply to Eugeny from comment #15)
> > Created attachment 139406 [details]
> > dmesg v4
> > 
> > Trying apply path https://bugs.freedesktop.org/attachment.cgi?id=138930 on
> > HEAD 29940f138482ff38047287ad288cea1fcf1f73b4 shows warning: "patch
> > unexpectedly ends in middle of line"
> 
> The patch is complete, so something is definitely wrong here...
> 
> > 
> > but still I tried built it and got dmesg after 10 tries.
> 
> Anyways, we let this bug fall through the cracks, and I am sorry about that!
> 
> Do you think it is possible for you to re-test again?

There was an offline discussion where the conclusion was that there is no problem with the programming sequence of PLL enabling in the driver, at least based on BSpec. So no need to try the patch. Extending the PLL lock timeout makes the bug more difficult to reproduce, but even with a 1 sec timeout it happens from time to time. So we need to debug further the root cause of the lock failure and possibly try to use an alternative clock source. Since the problem looks related to this specific system (SuperMicro SKL) we need a version of that board with JTAG connector and JTAG debugging enabled. I asked Eugeny to contact the vendor if it's possible to get such a board.

Accordingly setting this to NEEDINFO.
Comment 18 Lakshmi 2018-08-24 12:24:33 UTC
I asked Eugeny to contact the vendor if it's possible to get such a
> board.
Any work around for this to proceed further if we don't have the board yet?
Comment 19 Lakshmi 2018-09-10 06:28:39 UTC
Imre, any updates on this issue?
Comment 20 Imre Deak 2018-09-10 09:06:47 UTC
(In reply to Lakshmi from comment #19)
> Imre, any updates on this issue?

Still waiting for the board with JTAG enabled, haven't heard from Dmitry for a while.
Comment 21 Dmitry Ermilov 2018-09-10 09:14:03 UTC
(In reply to Imre Deak from comment #20)
> (In reply to Lakshmi from comment #19)
> > Imre, any updates on this issue?
> 
> Still waiting for the board with JTAG enabled, haven't heard from Dmitry for
> a while.

So far all efforts to enable JTAG on this SM board were not successful.
Evgeny, can you please say what exactly doesn't work?
Comment 22 Lakshmi 2019-02-26 09:13:08 UTC
Dmitry, how to proceed with this bug?
Comment 23 Dmitry Ermilov 2019-02-26 09:17:00 UTC
JTAG is absent in SMC. We spent a lot of time trying to get it working using engineering samples but in vain. This activity has been abandoned.
Please feel free to close the ticket since it seems there is no way to investigate the issue properly.
Comment 24 Lakshmi 2019-02-26 09:42:02 UTC
Thanks for the feedback- Closing this bug.
Comment 25 Lakshmi 2019-03-06 10:23:21 UTC
SMC board is avaialble for investigation.
Comment 26 Lakshmi 2019-06-04 10:16:11 UTC
Imre, any updates here?
Comment 27 ashutosh.dixit 2019-11-25 19:49:40 UTC
Past SLA, any updates, should we reduce the priority on this one?

#assessment
Comment 28 Martin Peres 2019-11-29 17:44:19 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/98.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.