Bug 109078 - Warning PID: 1150 at drivers/gpu/drm/i915/intel_cdclk.c:835
Summary: Warning PID: 1150 at drivers/gpu/drm/i915/intel_cdclk.c:835
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-17 16:14 UTC by Valerio Vanni
Modified: 2019-05-02 12:11 UTC (History)
1 user (show)

See Also:
i915 platform: CFL
i915 features:


Attachments
dmesg with drm.debug=0x1e (working state) (594.54 KB, text/plain)
2018-12-18 15:14 UTC, Valerio Vanni
no flags Details
dmesg with drm.debug=0x1e (warning state) (394.82 KB, text/plain)
2018-12-21 17:30 UTC, Valerio Vanni
no flags Details
dmesg with drm.debug=0x1e (warning state - kernel 4.20) (3.52 MB, text/plain)
2019-01-02 02:43 UTC, Valerio Vanni
no flags Details
warning with 4.20.3 - (2.10 MB, text/plain)
2019-01-21 09:44 UTC, Valerio Vanni
no flags Details

Description Valerio Vanni 2018-12-17 16:14:08 UTC
I have Debian stretch, with sysv init and vanilla kernel 4.19.9 (the same happened with 4.18 kernels; I just updated to 4.19 to see if it fixed, but it doesn't).

Machine is a Coffee Lake CPU (i8700) on an Asus Prime B306M-A motherboard.

Video driver in xorg.conf.d is modesetting.
In modprobe.d I have "options i915 enable_guc=3" "options i915 modeset=1"


Usually all goes fine.

Sometime I find the system on, with blank screen and no response to keyboard, mouse, ping, telnet etc (I have to power off by the button).
Perhaps it's not related, I don't see anything (I have to fix my serial console cable).

Some other times the system boots but I find this in dmesg:


[   26.731496] ------------[ cut here ]------------
[   26.827551] WARN_ON((val & ((1 << ((0) * 6 + 5)) | (1 << ((0) * 6 + 4)) | (1 << ((0) * 6)))) != (1 << ((0) * 6)))
[   26.941236] WARNING: CPU: 1 PID: 1150 at drivers/gpu/drm/i915/intel_cdclk.c:835 skl_get_cdclk+0x211/0x250 [i915]
[   26.941237] Modules linked in:
[   27.173871]  videobuf2_memops videobuf2_v4l2 kvm_intel snd_hwdep snd_pcm_oss i915(+) kvm snd_mixer_oss
[   27.545801]  r8169 videodev xhci_pci snd_pcm iosf_mbi snd_timer libphy xhci_hcd intel_gtt irqbypass videobuf2_common
[   27.728737]  pcspkr rtc_cmos video backlight evdev
[   27.728739] CPU: 1 PID: 1150 Comm: systemd-udevd Tainted: G     U            4.19.9 #1
[   27.728740] Hardware name: System manufacturer System Product Name/PRIME B360M-A, BIOS 1602 10/10/2018
[   27.963411] RIP: 0010:skl_get_cdclk+0x211/0x250 [i915]
[   27.963412] Code: a0 48 c7 c7 bc 2e ab a0 e8 7c b3 62 e0 0f 0b 8b 53 04 e9 35 fe ff ff 48 c7 c6 e8 5d ac a0 48 c7 c7 bc 2e ab a0 e8 5f b3 62 e0 <0f> 0b 8b 53 04 e9 18 fe ff ff 89 c2 48 c7 c6 0c 2f ab a0 48 c7 c7
[   28.527720] RSP: 0018:ffffc90000687ac8 EFLAGS: 00010286
[   28.574275] RAX: 0000000000000000 RBX: ffff88845654554c RCX: 0000000000000000
[   28.643540] RDX: 0000000000000001 RSI: 0000000000000092 RDI: ffffffff82051bac
[   28.712811] RBP: ffff888456540000 R08: ffff88845df66280 R09: 0000000000000001
[   28.782077] R10: 0000000070000007 R11: 0000000000000000 R12: 0000000000000000
[   28.851344] R13: ffff8884565468f0 R14: ffff88845cb1c000 R15: ffff8884565406b0
[   28.920620] FS:  00007f22acc9f8c0(0000) GS:ffff88846da40000(0000) knlGS:0000000000000000
[   29.001250] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   29.054006] CR2: 00007f2510fca690 CR3: 000000045c5dc005 CR4: 00000000003606e0
[   29.123277] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   29.192552] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   29.261821] Call Trace:
[   29.275370]  intel_update_cdclk+0x17/0x60 [i915]
[   29.314803]  skl_init_cdclk+0x3d/0x1b0 [i915]
[   29.351037]  ? intel_power_well_enable+0x2f/0x40 [i915]
[   29.397601]  intel_power_domains_init_hw+0x771/0x990 [i915]
[   29.448292]  i915_driver_load+0x94a/0xf50 [i915]
[   29.487614]  pci_device_probe+0xa1/0x130
[   29.518688]  really_probe+0x230/0x2d0
[   29.546663]  driver_probe_device+0x4b/0xe0
[   29.579796]  __driver_attach+0xb4/0xc0
[   29.608799]  ? driver_probe_device+0xe0/0xe0
[   29.643998]  bus_for_each_dev+0x62/0xb0
[   29.674038]  bus_add_driver+0x10c/0x210
[   29.704079]  driver_register+0x56/0xe0
[   29.733086]  ? 0xffffffffa0b46000
[   29.756931]  do_one_initcall+0x43/0x1c0
[   29.786976]  ? __vunmap+0x75/0xb0
[   29.810824]  ? _cond_resched+0x11/0x40
[   29.839835]  do_init_module+0x56/0x1f0
[   29.868846]  load_module+0x1f72/0x25b0
[   29.897859]  ? vfs_read+0x114/0x130
[   29.923767]  ? __se_sys_finit_module+0xb3/0xc0
[   29.961037]  __se_sys_finit_module+0xb3/0xc0
[   29.996248]  do_syscall_64+0x43/0xf0
[   30.023191]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.067682] RIP: 0033:0x7f22abb5b229
[   30.094633] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 3f 4c 2b 00 f7 d8 64 89 01 48
[   30.302254] RSP: 002b:00007ffe9a7c9fd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[   30.376693] RAX: ffffffffffffffda RBX: 0000557366cb0ab0 RCX: 00007f22abb5b229
[   30.445958] RDX: 0000000000000000 RSI: 0000557366caeaf0 RDI: 000000000000000e
[   30.515229] RBP: 0000557366caeaf0 R08: 0000000000000000 R09: 0000000000000016
[   30.584491] R10: 000000000000000e R11: 0000000000000246 R12: 0000000000000000
[   30.653753] R13: 0000557366cf1c70 R14: 0000000000020000 R15: 0000000000000000
[   30.723014] ---[ end trace e23c72a78f7a8889 ]---
Comment 1 Mark Janes 2018-12-17 16:18:49 UTC
I would disable the GuC.  It has not been properly validated for Mesa.
Comment 2 Lakshmi 2018-12-18 04:48:06 UTC
Valerio, Can you attach the full dmesg log from boot with kernal parameters drm.debug=0x1e log_buf_len=4M?
Also, can you verify with Kernel 4.20?
Comment 3 Valerio Vanni 2018-12-18 13:58:15 UTC
(In reply to Lakshmi from comment #2)
> Valerio, Can you attach the full dmesg log from boot with kernal parameters
> drm.debug=0x1e log_buf_len=4M?
> Also, can you verify with Kernel 4.20?

I've tried, but it fills the buffer with [drm:] entries and I lose all early boot ones.
I've tried increasing log_buf_len, but I've got nothing better.
Comment 4 Valerio Vanni 2018-12-18 15:14:47 UTC
Created attachment 142848 [details]
dmesg with drm.debug=0x1e (working state)

This is a dmesg of the working (no error) state. I'll post also one when the error will occur.
Comment 5 Valerio Vanni 2018-12-21 17:30:04 UTC
Created attachment 142873 [details]
dmesg with drm.debug=0x1e (warning state)
Comment 6 Valerio Vanni 2018-12-21 17:32:37 UTC
Comment on attachment 142873 [details]
dmesg with drm.debug=0x1e (warning state)

Today I got the warning
Comment 7 Valerio Vanni 2019-01-02 02:43:42 UTC
Created attachment 142938 [details]
dmesg with drm.debug=0x1e (warning state - kernel 4.20)

And now I got the warning with kernel 4.20
Comment 8 Valerio Vanni 2019-01-21 09:44:02 UTC
Created attachment 143173 [details]
warning with 4.20.3 -
Comment 9 Lakshmi 2019-02-07 12:07:58 UTC
(In reply to Valerio Vanni from comment #8)
> Created attachment 143173 [details]
> warning with 4.20.3 -

Can you try comment 1 and check if you still see the warning?
Comment 10 Lakshmi 2019-02-22 10:13:26 UTC
Valerio, any updates?
Comment 11 Valerio Vanni 2019-02-25 00:42:11 UTC
Since some time I haven't seen the warning anymore.

I'm not able to relate this change to anything. Now I'm using kernel 4.20.3, but the warning disappeared later (i mean: it happened also with this kernel).

I didn't change hardware.
Debian Stretch got many updates, this is the only change I can see.


The guc has always been active.
Comment 12 Valerio Vanni 2019-03-03 20:40:32 UTC
It just happened.

Now I'm disabling the GUC.
Comment 13 Valerio Vanni 2019-03-06 22:55:59 UTC
And now it happened even with GUC disabled.
Comment 14 Mark Janes 2019-03-07 17:18:11 UTC
Valerio, I couldn't find the motherboard you describe.  Do you mean the B360M-A?

  https://www.newegg.com/Product/Product.aspx?Item=N82E16813119086

Things that I would try:

 - eliminate hardware as a root cause
   - check for bios update
   - verify that RAM passes memcheck

What is it about the error message that causes you to think this is a graphics bug?  The stack trace makes me think there is something wrong with power management.  The symptoms (won't wake up) seem to be similar too, although from your description, it doesn't sound like the machine is asleep.
Comment 15 Valerio Vanni 2019-03-07 19:03:34 UTC
>Valerio, I couldn't find the motherboard you describe.  Do you mean the B360M-A?
>
>  https://www.newegg.com/Product/Product.aspx?Item=N82E16813119086

Yes, I typed it wrong.


>Things that I would try:
>
> - eliminate hardware as a root cause
>   - check for bios update

BIOS is 1602, updated at Novembre. Now I see that there's an updated version, I'll try immediately.

>   - verify that RAM passes memcheck

I'll try this too. After I built the machine, I did and it was ok.

>What is it about the error message that causes you to think this is a graphics >bug?  The stack trace makes me think there is something wrong with power >management.

I suspected it from the first line
WARNING: CPU: 1 PID: 1150 at drivers/gpu/drm/i915/intel_cdclk.c:835 skl_get_cdclk+0x211/0x250 [i915]

Where in stack trace do you find power management issue?
I'm not very good at analyize stack trace, as I said i focused on the first lines.


>The symptoms (won't wake up) seem to be similar too, although from >your >description, it doesn't sound like the machine is asleep.

But perhaps they are not related.

Sometimes (but even less frequently) the machine starts up (power led on) but with blank screen, no response to ping etc.
I'm not able to say if it stops at BIOS or during linux boot because it usually happens when I'm on another machine with a kvm switch. I'm on another machine, I power on this, after some time I point the kvm switch on this machine and I find it dead.

I have to fix my serial port bracket, because it has a wrong pinout for thism MB. Perhaps with serial console I can see something on serial port, if Linux boot has at least started.

When I see this warning, instead, the machine is alive and (seems to be) working.
Comment 16 Lakshmi 2019-03-27 10:57:34 UTC
(In reply to Valerio Vanni from comment #15)
> >Valerio, I couldn't find the motherboard you describe.  Do you mean the B360M-A?
> >
> >  https://www.newegg.com/Product/Product.aspx?Item=N82E16813119086
> 
> Yes, I typed it wrong.
> 
> 
> >Things that I would try:
> >
> > - eliminate hardware as a root cause
> >   - check for bios update
> 
> BIOS is 1602, updated at Novembre. Now I see that there's an updated
> version, I'll try immediately.
> 
> >   - verify that RAM passes memcheck
> 
> I'll try this too. After I built the machine, I did and it was ok.
> 
> >What is it about the error message that causes you to think this is a graphics >bug?  The stack trace makes me think there is something wrong with power >management.
> 
> I suspected it from the first line
> WARNING: CPU: 1 PID: 1150 at drivers/gpu/drm/i915/intel_cdclk.c:835
> skl_get_cdclk+0x211/0x250 [i915]
> 
> Where in stack trace do you find power management issue?
> I'm not very good at analyize stack trace, as I said i focused on the first
> lines.
> 
> 
> >The symptoms (won't wake up) seem to be similar too, although from >your >description, it doesn't sound like the machine is asleep.
> 
> But perhaps they are not related.
> 
> Sometimes (but even less frequently) the machine starts up (power led on)
> but with blank screen, no response to ping etc.
> I'm not able to say if it stops at BIOS or during linux boot because it
> usually happens when I'm on another machine with a kvm switch. I'm on
> another machine, I power on this, after some time I point the kvm switch on
> this machine and I find it dead.
> 
> I have to fix my serial port bracket, because it has a wrong pinout for
> thism MB. Perhaps with serial console I can see something on serial port, if
> Linux boot has at least started.
> 
> When I see this warning, instead, the machine is alive and (seems to be)
> working.

Any further updates here?
Comment 17 Valerio Vanni 2019-04-08 21:11:43 UTC
Soon after my last message, I updated BIOS to version 2012.
Since then, the issue has not come out anymore.

The issue happened only after a reboot or shutdown, never after a suspend (S3) and resume.
And for all this time I've always done full reboots to do a better test with more triggering events.

It seems that BIOS updated fixed the issue.
Comment 18 Lakshmi 2019-05-02 12:11:05 UTC
(In reply to Valerio Vanni from comment #17)
> Soon after my last message, I updated BIOS to version 2012.
> Since then, the issue has not come out anymore.
> 
> The issue happened only after a reboot or shutdown, never after a suspend
> (S3) and resume.
> And for all this time I've always done full reboots to do a better test with
> more triggering events.
> 
> It seems that BIOS updated fixed the issue.

Thanks for the feedback. Closing this bug as WORKSFORME.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.