Bug 105962

Summary:

[KBL] "enable_rc6" parameter deprecation brings back freezing

Product:

DRI

Reporter:

Filip <tyx027>

Component:

DRM/Intel

Assignee:

Anshuman Gupta <anshuman.gupta>

Status:

RESOLVED MOVED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

high

CC:

anshuman.gupta, bbrandl.atoss+freedesktop, freedesktopbz, freedesktop.org, imre.deak, intel-gfx-bugs, jussi, lakshminarayana.vudum, mapengyu, mika.kuoppala, timo.teras, tomi.p.sarvela

Version:

XOrg git

Hardware:

All

OS:

Linux (All)

Whiteboard:

Triaged, ReadyForDev

i915 platform:

KBL

i915 features:

power/Other

Attachments:

Description	Flags
Dmesg output with drm.debug=0xe	none
Journalctl 5 minutes before the freeze	none
Logfile journalctl, freeze without disabling RC6	none
various dmesg w/ drm.debug=0xe on dell laptop	none

Description Filip 2018-04-10 00:11:00 UTC

Hello,

I noticed that the "enable_rc6" parameter is gone since kernel 4.16 and found out the reason is that there aren't any bugs related to it anymore.

Unfortunately this is not true. Lenovo's V310 and V510 laptops massively suffer from random freezing and so far the only fully working solution has been to set the mentioned parameter to 0. The "enable_dc=0" parameter alone is not enough as the freezing is back since kernel 4.16. 

Is there any way of a passing an equivalent parameter or any way to turn off RC6? If not, on behalf of all owners of laptops mentioned above I kindly beg the developers to consider the non-negligible effect this deprecation will have on us.

More on these Lenovo laptops freezing: https://forums.lenovo.com/t5/Lenovo-C-E-K-M-N-and-V-Series/V510-15IKB-Laptop-Freeze/td-p/3577112

Best regards,
Filip

Comment 1 Jani Saarinen 2018-04-10 05:30:21 UTC

Imre, any comments?

Comment 2 Chris Wilson 2018-04-10 08:32:37 UTC

It wasn't rc6 you wanted but the side-effect of disabling powersaving.

Comment 3 Imre Deak 2018-04-10 11:16:46 UTC

Did you try if limiting CPU C states also gets rid of the problem (leaving graphics power saving enabled)? If using intel_idle you can boot with the

intel_idle.max_cstate=1

kernel parameter to do this.

Could you provide a dmesg log booting with drm.debug=0xe?

Did you try to collect logs after the freeze?:
- via ssh if it still works
- net or serial console if the machine has a serial/ethernet plug (for netconsole https://wiki.archlinux.org/index.php/Netconsole)
- using pstore, having EFI or RAM based pstore, hard and soft lockup detection enabled in your kconfig and booting with "nmi_watchdog=panic panic=5" kernel parameters.

Comment 4 Filip 2018-04-10 12:25:27 UTC

I am using intel_idle (checked with 'cat /sys/devices/system/cpu/cpuidle/current_driver
') so I just turned off power saving and limited the C states and will keep you informed on the results.

I'll attach a dmesg output as well and see what I can do about collecting the logs after freezes. Another user in the Lenovo thread mentioned they have already been collecting them via netconsole so I will also ask them to post the logs here if possible.

Comment 5 Filip 2018-04-10 12:26:41 UTC

Created attachment 138729 [details]
Dmesg output with drm.debug=0xe

Comment 6 Filip 2018-04-10 12:34:47 UTC

Comment on attachment 138729 [details]
Dmesg output with drm.debug=0xe

Note: this is just after boot, not after the freeze event

Comment 7 Filip 2018-04-11 09:59:40 UTC

Unfortunately the other user says there is zero output with netconsole when the freezes occur, but he has added the debug option and will see if something happens.

As for me, I haven't had a freeze so far, but will keep testing since the freezes can happen multiple times a day, but also only once in a few days. I reckon if they don't occur after a week or two it would be confirmation enough that limiting c-states is a workaround.

In the meantime I would like to provide a short summary of what the issue with these laptops has been. I apologize if this is spammy in any way and please ignore if it is, but I realized the thread I linked is too long so maybe this can be helpful.

The freezes that we refer to are random in nature and total in their effect - meaning physical power-off is necessary. They happen both in Windows and Linux. Even though the thread is for the Kaby Lake V510 model, IIRC there have been freezes with the V310 series as well, and the Skylake version was not exempt. The last time I counted, some 30-ish users had reported this issue, but the confirmed count is much higher since some IT personnel reported freezes on their whole batches of acquired laptops. We believe the issue has something to do with Intel power-saving, but it's quite unclear if this is caused by a driver issue or is a result of bad Lenovo BIOS or motherboard. Lenovo has been unresponsive, while their service centers have usually been replacing the motherboards, which is a solution that helped only one user so far. Windows hacks that worked for some (but not all) users: https://forums.lenovo.com/t5/Lenovo-C-E-K-M-N-and-V-Series/V510-15IKB-Laptop-Freeze/m-p/3852313#M24549. And the Linux hack that worked involved disabling DC and RC6. And oh yes, there was also a previous one that involved turning off DRI, but that came with heavy side-effects.

Comment 8 Filip 2018-04-13 11:04:33 UTC

Update: a freeze did unfortunately occur with c-states limited. The other user from the forum also mentioned he tested this before and had the same outcome. 

Journalctl doesn't show anything out of the ordinary. i915 was just switching DC states from 00 to 02 and vice versa, the last one it switched to being 00. What may be interesting is that i915 had been quiet for 14 seconds prior to the freeze, while it usually does something every two seconds. This also happened before in the session, however, and with no freeze. Due to some obstacles, I had not gotten to setting up something to obtain a log while the machine is frozen, but I will see what I can do.

The other user's comment on logging, however, is: "You will don't find any logs related that freeze. Even not with kernels netconsole or any debugging parameters.  I've spend many time to that issue and find nothing"

Comment 9 Filip 2018-04-13 11:05:21 UTC

Created attachment 138823 [details]
Journalctl 5 minutes before the freeze

Comment 10 Jani Saarinen 2018-04-25 11:52:20 UTC

Imre, any advice to proceed here?

Comment 11 Imre Deak 2018-04-25 14:11:16 UTC

(In reply to Filip from comment #8)
> Update: a freeze did unfortunately occur with c-states limited. The other
> user from the forum also mentioned he tested this before and had the same
> outcome. 

Ok, thanks for trying.

> Journalctl doesn't show anything out of the ordinary. i915 was just
> switching DC states from 00 to 02 and vice versa, the last one it switched
> to being 00. What may be interesting is that i915 had been quiet for 14
> seconds prior to the freeze, while it usually does something every two
> seconds. This also happened before in the session, however, and with no
> freeze.

Ok, as I understood you already tried booting with i915.enable_dc=0 and that didn't get rid of the problem.

Could you confirm that all display outputs were off when the freeze happened?

Do you see any other pattern in what you do before the freeze?

I'm guessing the DC state toggling is due to GPU activity, probably due to updating the clock in your GUI. Could you try preventing these updates (and any other GPU activity) for instance by switching away to another VT from your GUI and seeing if the freeze still happens? Please also provide a dmesg log booting with drm.debug=0x1f up to the freeze to double-check what causes the DC state toggling.

Could you try if booting with nomodeset the freeze still happens?

> Due to some obstacles, I had not gotten to setting up something to
> obtain a log while the machine is frozen, but I will see what I can do.
> 
> The other user's comment on logging, however, is: "You will don't find any
> logs related that freeze. Even not with kernels netconsole or any debugging
> parameters.  I've spend many time to that issue and find nothing"

Ok, please still try if the pstore method provides something.

Thanks.

Comment 12 Filip 2018-04-26 11:54:47 UTC

(In reply to Imre Deak from comment #11)

> Ok, as I understood you already tried booting with i915.enable_dc=0 and that
> didn't get rid of the problem.

Yes, rc6 needs to be turned off as well.

> Could you confirm that all display outputs were off when the freeze happened?

How can I check this?

> Do you see any other pattern in what you do before the freeze?

No, unfortunately that is the thing with these freezes - they are completely random and cannot be straightforwardly reproduced. A stress test e.g. won't help. From everything that has been written on the forum, they do however seem to happen more often when the GPU is doing work.

> Could you try preventing these updates (and any other GPU activity) for
> instance by switching away to another VT from your GUI and seeing if the freeze > still happens? 
 
How do I go about doing this?

> Please also provide a dmesg og booting with drm.debug=0x1f up to the freeze to > double-check what causes the DC state toggling.

> Could you try if booting with nomodeset the freeze still happens?

> Ok, please still try if the pstore method provides something.

These I mostly understand how to do, except the pstore method, but there may be a guide somewhere. Unfortunately I've had to go back to kernel 4.14 and disabling rc6 due to working on essays for uni deadlines so I will try all this as soon as I'm in the clear, but will also ask again that the other Linux users from the forum contribute here if they can.

Comment 13 paulz 2018-04-26 13:02:43 UTC

(In reply to Imre Deak from comment #11)
> (In reply to Filip from comment #8)
> > Update: a freeze did unfortunately occur with c-states limited. The other
> > user from the forum also mentioned he tested this before and had the same
> > outcome. 
> 
> Ok, thanks for trying.
disabling c-states do not help


> Ok, as I understood you already tried booting with i915.enable_dc=0 and that
> didn't get rid of the problem.
yes, RC6 have to be disabled

 
> Could you confirm that all display outputs were off when the freeze happened?
the screens are not off but freezed. After a long time, the screens are black if i remember correctly


> I'm guessing the DC state toggling is due to GPU activity, probably due to
> updating the clock in your GUI. Could you try preventing these updates (and
> any other GPU activity) for instance by switching away to another VT from
> your GUI and seeing if the freeze still happens?
switching away to another VT is NOT possible, its the whole PC that freeze!
Even SysRq don't work, keyboard is also dead


> Please also provide a dmesg
> log booting with drm.debug=0x1f up to the freeze to double-check what causes
> the DC state toggling.
i will do that.


> Could you try if booting with nomodeset the freeze still happens?
i will give it a try


> Ok, please still try if the pstore method provides something.
i will give it a try

Comment 14 paulz 2018-04-26 13:24:49 UTC

(In reply to paulz from comment #13)

> > Could you try if booting with nomodeset the freeze still happens?
> i will give it a try

with nomodeset i can`t login through Gnome Desktop Manager. Other VT works, but i need graphical environment, so i removed that option again, sorry

Comment 15 paulz 2018-04-26 14:13:34 UTC

Created attachment 139135 [details]
Logfile journalctl, freeze without disabling RC6

Comment 16 paulz 2018-04-26 14:15:02 UTC

(In reply to paulz from comment #13)
> > Please also provide a dmesg
> > log booting with drm.debug=0x1f up to the freeze to double-check what causes
> > the DC state toggling.
> i will do that.

freeze after less then 30 minutes without disabling RC6.

logfile: https://bugs.freedesktop.org/attachment.cgi?id=139135

Comment 17 Imre Deak 2018-04-26 14:33:58 UTC

(In reply to paulz from comment #13)
> [...]
> > I'm guessing the DC state toggling is due to GPU activity, probably due to
> > updating the clock in your GUI. Could you try preventing these updates (and
> > any other GPU activity) for instance by switching away to another VT from
> > your GUI and seeing if the freeze still happens?
> switching away to another VT is NOT possible, its the whole PC that freeze!
> Even SysRq don't work, keyboard is also dead

I meant here to switch to another VT from the GUI before the freeze to avoid any GPU activity (it looks like it is the periodic clock update based on your later logs) and see if the freeze still happens.

Comment 18 Imre Deak 2018-04-26 14:35:15 UTC

(In reply to paulz from comment #14)
> (In reply to paulz from comment #13)
> 
> > > Could you try if booting with nomodeset the freeze still happens?
> > i will give it a try
> 
> with nomodeset i can`t login through Gnome Desktop Manager. Other VT works,
> but i need graphical environment, so i removed that option again, sorry

Here again the idea would be to see if without the i915 driver loaded the machine still freezes.

Comment 19 Imre Deak 2018-04-26 14:51:13 UTC

(In reply to paulz from comment #16)
> (In reply to paulz from comment #13)
> > > Please also provide a dmesg
> > > log booting with drm.debug=0x1f up to the freeze to double-check what causes
> > > the DC state toggling.
> > i will do that.
> 
> freeze after less then 30 minutes without disabling RC6.
> 
> logfile: https://bugs.freedesktop.org/attachment.cgi?id=139135

Thanks, looks like the only activity preceding the freeze is some periodic GPU command, I suppose to update the clock in GUI, but nothing out of ordinary. You could still check if enabling pstore would provide additional logs after freeze and reboot. For that you'd need to build your kernel with EFI or RAM based PSTORE support (for EFI: CONFIG_PSTORE=y, CONFIG_EFI_VARS_PSTORE=y) and boot with the 'nmi_watchdog=panic panic=5' kernel params. After freeze/rebooting

# mount -t pstore none <dir>

should put any such logs in <dir>.

Comment 20 Arthur 2018-04-26 19:18:41 UTC

Hi all, at the risk flooding this thread with "me too"'s I'd like to add some information that might be useful. Some background: I have a X1 Carbon 5th Gen with an i5-7200U and Intel Graphics 620 and am experiencing a very similar problem as Filip. The only difference is that my computer does not freeze, it simply shuts down suddenly and violently as if someone disconnected the battery.

Like Filip and others the solution was to set i915.enable_rc6=0. I have tried all other solutions presented in this thread to no avail.

I have tried capturing something (**anything**) from these crashes with no luck. I tried using Kdump and a PSTORE EFI method as suggested by Imre. Nothing works.

Now the potentially useful info:

o I don't use a DM, so I have to start X manually. My computer runs perfectly if I don't start an X session. Absolutely no problems under all operating conditions.

o As Imre recommended I tried booting with mode setting turned off. This "worked". I put that in quotes because the graphical capabilities were very reduced. The only driver that was able to run was xf86-video-fbdev and even then X was, for example, not aware of any monitor settings and could not support an external monitor. That said, I WAS able to fire up my WM and use my computer in this reduced state with no crashes.

o I am currently running Arch Linux, but I confirmed that the crashes still happen when using an Ubuntu Live USB. The Ubuntu I tested has linux 4.13.

I hope this information can be useful; please let me know if you need any more information. I am watching this thread eagerly for a solution.

Comment 21 paulz 2018-04-27 07:54:39 UTC

(In reply to Imre Deak from comment #19)
> Thanks, looks like the only activity preceding the freeze is some periodic
> GPU command, I suppose to update the clock in GUI, but nothing out of
> ordinary. You could still check if enabling pstore would provide additional
> logs after freeze and reboot. For that you'd need to build your kernel with
> EFI or RAM based PSTORE support (for EFI: CONFIG_PSTORE=y,
> CONFIG_EFI_VARS_PSTORE=y) and boot with the 'nmi_watchdog=panic panic=5'
> kernel params. After freeze/rebooting
> 
> # mount -t pstore none <dir>
> 
> should put any such logs in <dir>.

paul@behemoth:~$ grep CONFIG_PSTORE /boot/config-4.13.0-39-generic
CONFIG_PSTORE=y

paul@behemoth:~$ grep CONFIG_EFI_VARS_PSTORE /boot/config-4.13.0-39-generic
CONFIG_EFI_VARS_PSTORE=m

i guess i have to rebuild the kernel because CONFIG_EFI_VARS_PSTORE is not "y"?

Comment 22 paulz 2018-04-30 09:28:55 UTC

i've set nmi_watchdog=panic and panic=5 and waiting for freeze.

i also noticed that pstore is already mounted but /sys/fs/pstore is always empty and kernel don't reboot after 5 seconds.

paul@behemoth:~$ mount -l
pstore on /sys/fs/pstore type pstore (rw,nosuid,nodev,noexec,relatime)

Comment 23 Jani Saarinen 2018-04-30 09:40:11 UTC

Tomi, any help here?

Comment 24 Tomi Sarvela 2018-05-04 15:04:11 UTC

If you see /sys/fs/pstore, then the feature is enabled and working.

Next thing is to configure linux to write dmesg buffer to pstore when panicing, and the important thing is to panic. Add to command line:

nmi_watchdog=panic,auto panic=5 softdog.soft_panic=5

(5 on the command line means 5 seconds)

Unfortunately, in many suspend/hang issues nothing gets written to pstore because CPU is not available. Pstore is a last resort thing, a bit like console: it might help, or it might stay silent.

Comment 25 Jani Saarinen 2018-05-07 10:37:20 UTC

Reporter, was yu able to get this working?

Comment 26 paulz 2018-05-07 11:27:03 UTC

(In reply to Tomi Sarvela from comment #24)
> If you see /sys/fs/pstore, then the feature is enabled and working.
> 
> Next thing is to configure linux to write dmesg buffer to pstore when
> panicing, and the important thing is to panic. Add to command line:
> 
> nmi_watchdog=panic,auto panic=5 softdog.soft_panic=5
> 
> (5 on the command line means 5 seconds)
> 
> Unfortunately, in many suspend/hang issues nothing gets written to pstore
> because CPU is not available. Pstore is a last resort thing, a bit like
> console: it might help, or it might stay silent.

No it's not, EFI store is just mounted!

We have also have to load the module "efi_pstore" (CONFIG_PSTORE_RAM=m) and maybe "ramoops" so i added it to /etc/modules!

Now there are dmesg-efi-* files in /sys/fs/pstore/ if i trigger a kernel crash!

now i am waiting for next freeze whithout disabling "rc6" (boot options: debug ignore_loglevel drm.debug=0x1f nmi_watchdog=panic,auto panic=5 softdog.soft_panic=5)

Comment 27 paulz 2018-05-07 11:55:45 UTC

Update: freeze happend without and logs in pstore and also no reboot after 5 seconds (panic=5). Only when i trigger manually the crash!

I also want to notice that even with disabled i915 rc6 the notebook are not very stable, so one crash per week can happen! But without disabling rc6 it happen after a few minutes!

Comment 28 paulz 2018-05-11 07:35:55 UTC

(In reply to Imre Deak from comment #17)
> (In reply to paulz from comment #13)
> > [...]
> > > I'm guessing the DC state toggling is due to GPU activity, probably due to
> > > updating the clock in your GUI. Could you try preventing these updates (and
> > > any other GPU activity) for instance by switching away to another VT from
> > > your GUI and seeing if the freeze still happens?
> > switching away to another VT is NOT possible, its the whole PC that freeze!
> > Even SysRq don't work, keyboard is also dead
> 
> I meant here to switch to another VT from the GUI before the freeze to avoid
> any GPU activity (it looks like it is the periodic clock update based on
> your later logs) and see if the freeze still happens.

confirmed! no freeze on other VT (bash shell) since >36 hours without setting i915.enable_rc6=0

Comment 29 paulz 2018-05-15 09:20:11 UTC

(In reply to Imre Deak from comment #11)
> I'm guessing the DC state toggling is due to GPU activity, probably due to
> updating the clock in your GUI. Could you try preventing these updates (and
> any other GPU activity) for instance by switching away to another VT from
> your GUI and seeing if the freeze still happens? Please also provide a dmesg
> log booting with drm.debug=0x1f up to the freeze to double-check what causes
> the DC state toggling.

i've removed the option to disable i915 rc6 and added instead a udev rule to lock the GPU frequency and boost (!) frequency to PR1.

ACTION=="add", KERNEL=="card0", SUBSYSTEM=="drm", ATTR{gt_max_freq_mhz}="300", ATTR{gt_boost_freq_mhz}="300"

Until now no Freeze!! (since yesterday keeping power on)

Comment 30 Arthur 2018-05-15 20:07:14 UTC

I tried paulz's UDEV fix and I'm still getting crashes. I confirmed that the changes in max frequency were correctly applied, but it didn't fix my problem. The result was the same using kernel 4.15.15 (without i915_enable_rc6=0) or kernel 4.16.8 (which doesn't recognize the rc6 parameter anyway).

I also tried setting the min and max frequency to an even lower number (200 MHz), but there seems to be a hard lower limit of 300 MHz (maybe because that's what RP1_freq is set to and this parameter seems un-modifiable).

Comment 31 paulz 2018-05-16 07:52:33 UTC

(In reply to Arthur from comment #30)
> I tried paulz's UDEV fix and I'm still getting crashes. I confirmed that the
> changes in max frequency were correctly applied, but it didn't fix my
> problem. The result was the same using kernel 4.15.15 (without
> i915_enable_rc6=0) or kernel 4.16.8 (which doesn't recognize the rc6
> parameter anyway).
> 
> I also tried setting the min and max frequency to an even lower number (200
> MHz), but there seems to be a hard lower limit of 300 MHz (maybe because
> that's what RP1_freq is set to and this parameter seems un-modifiable).

"cur", "min" and "max" have to be equal!

paul@behemoth:~$ sudo intel_gpu_frequency -g
cur: 300 MHz
min: 300 MHz
RP1: 300 MHz
max: 300 MHz

PS: up 1 day, 16:42 (without i915_enable_rc6=0)

Comment 32 Arthur 2018-05-16 15:49:14 UTC

(In reply to paulz from comment #31)
> "cur", "min" and "max" have to be equal!
> 
> paul@behemoth:~$ sudo intel_gpu_frequency -g
> cur: 300 MHz
> min: 300 MHz
> RP1: 300 MHz
> max: 300 MHz

Yes, I confirmed that all of these were equal and set to 300 MHz. In addition to the UDEV rules I also tried limiting the frequency with the intel_gpu_frequency tool:

$ sudo intel_gpu_frequency -i

Both this and the UDEV rules resulted in, as paulz stated, "cur", "min", and "max" being equal to 300 MHz. Unfortunately, even in this state I still get the same crashes.

Comment 33 paulz 2018-05-17 07:51:14 UTC

(In reply to Arthur from comment #32)
> In addition to the UDEV rules I also tried limiting the frequency with the
> intel_gpu_frequency tool:
> 
> $ sudo intel_gpu_frequency -i
> 
> Both this and the UDEV rules resulted in, as paulz stated, "cur", "min", and
> "max" being equal to 300 MHz. Unfortunately, even in this state I still get
> the same crashes.

Interesting. In my case the intel_gpu_frequency tool set "min" and "max" to PR1 but keep "cur" at 1050MHz and freeze still occur, because the GPU boost frequency are not changed, so i did it with UDEV. Using i7-7500U (HD Graphics 620).

PS: up 2 days, 16:38 (without i915_enable_rc6=0)

Comment 34 Arthur 2018-05-17 18:55:37 UTC

(In reply to paulz from comment #33)
> Interesting. In my case the intel_gpu_frequency tool set "min" and "max" to
> PR1 but keep "cur" at 1050MHz and freeze still occur, because the GPU boost
> frequency are not changed, so i did it with UDEV. Using i7-7500U (HD
> Graphics 620).
> 
> PS: up 2 days, 16:38 (without i915_enable_rc6=0)

I'm not sure that it matters, but my default max frequency is 1000 MHz. I'm on a i5-7200U with Intel Graphics 620.

Comment 35 paulz 2018-05-22 07:41:32 UTC

first of all: up 7 day, 16:03 (without i915_enable_rc6=0)
Now i rebooted by my self ;)

Another question:

systemd requested 14250 boosts in 12min! Is that normal? maybe the reason for instability?

paul@behemoth:~$ uptime
 09:33:45 up 12 min,  2 users,  load average: 0,61, 0,79, 0,55

paul@behemoth:~$ sudo cat /sys/kernel/debug/dri/0/i915_rps_boost_info
RPS enabled? 1
GPU busy? yes [3 requests]
CPU waiting? 0
Boosts outstanding? 1
Frequency requested 1050
  min hard:300, soft:1050; max soft:1050, hard:1050
  idle:300, efficient:300, boost:1050
systemd-logind [861]: 108 boosts
Xorg [1549]: 10 boosts
systemd-logind [861]: 14250 boosts
Xorg [2399]: 0 boosts
Xorg [2399]: 86 boosts
Xorg [2399]: 5 boosts
Xorg [2399]: 0 boosts
Kernel (anonymous) boosts: 0

RPS Autotuning (current "high power" window):
  Avg. up: 100% [above threshold? 85%]
  Avg. down: 46% [below threshold? 60%]

Comment 36 paulz 2018-05-22 09:44:37 UTC

freeze happend shortly after setting "gt_min_freq_mhz" to RP0 (full speed) to keep the GPU at max speed instead of setting "gt_max_freq_mhz" and "gt_boost_freq_mhz" to RP1 (most efficient, low speed)!

ACTION=="add", KERNEL=="card0", SUBSYSTEM=="drm", ATTR{gt_min_freq_mhz}="1050"

Comment 37 Arthur 2018-06-04 14:56:17 UTC

Hello,

Any word/news on this from the kernel side? I have spent days trying to emulate paulz's success through combinations of gpu frequency setting and/or power management rules, but to no avail. Does anyone have any other ideas of where I could direct my efforts?

Thanks!

Comment 38 Filip 2018-07-01 14:48:43 UTC

Is there any chance whatsoever of bringing back the option? 

I've googled around and noticed it has been used as a workaround for a few other cases as well. I understand that some of the underlying issues may not be related to RC6 directly, but it's still helpful to have it as  something to fall back on.

Comment 39 paulz 2018-07-03 08:13:57 UTC

i just want to notice that the GPU boosts as i described in post 35 come from my "I-Tec USB 3.0 Docking Station".

Actually i try the newest kernel without any "freeze workaround" and without my docking station and waiting for freezes.

Comment 40 Arthur 2018-08-07 14:04:16 UTC

Hello,

Is there any possibility that we'll either a) get the enable_rc6 parameter back in the kernel, or b) hear anything from the developers?

A quick search for "enable_rc6=0" will reveal a ton of recent bugs of various flavors (from screen flickering to severe crashing) that are "fixed" with this parameter. I know that using this parameter is like using a cruise missile to light a cigarette, but for a lot of people it's the only solution we have. The assumption that lead to this parameter being removed is just plain wrong.

Any sort of feedback would be greatly appreciated.

Thank you,
Arthur

Comment 41 Jani Nikula 2018-08-29 10:30:29 UTC

Please try kernel v4.18 or later with i915.dmc_firmware_path="" (or something that doesn't exist) to disable DMC firmware and consequently runtime PM.

Comment 42 paulz 2018-08-29 13:28:39 UTC

(In reply to Jani Nikula from comment #41)
> Please try kernel v4.18 or later with i915.dmc_firmware_path="" (or
> something that doesn't exist) to disable DMC firmware and consequently
> runtime PM.

i did it already, unloaded the DMC blob successfully and my system also freezed.
But not with 4.18 - any reason why this version?

Comment 43 Arthur 2018-08-29 16:15:07 UTC

Hi Jani,

Thanks for your response. I tried your suggestion on kernel 4.18.5 and confirmed that DMC firmware was not loaded:

$ dmesg | grep i915
[    0.000000] Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=XXX rw quiet i915.dmc_firmware_path=
[    0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=XXX rw quiet i915.dmc_firmware_path=
[    1.949621] i915 0000:00:02.0: enabling device (0006 -> 0007)
[    1.953115] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=mem
[    1.953356] i915 0000:00:02.0: Failed to load DMC firmware . Disabling runtime power management.
[    1.953359] i915 0000:00:02.0: DMC firmware homepage: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915
[    1.999251] [drm] Initialized i915 1.6.0 20180514 for 0000:00:02.0 on minor 0
[    2.004006] snd_hda_intel 0000:00:1f.3: bound 0000:00:02.0 (ops i915_audio_component_bind_ops [i915])
[    3.226906] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device

In this state I observed the following:

o With no peripherals plugged into my laptop I did not experience any immediate crashes (which I would have expected if the problem persisted). This is good news :)

o I then tried attaching a Thunderbolt 3 dock that, among other things, connects an external monitor. Less than a minute later the system crashed. The same thing happened if I booted the computer with the dock already connected.

o Just to test I then tried a boot with no dock and without turning off DMC firmware (i.e., vanilla 4.18.5). This resulted in a crash.

So, it seems like the DMC firmware suggestion does make a real difference in the absence of a dock/external monitor. Journalctl shows no strange behavior before the dock'ed crashes (just like before), but over the next day or so I'll try it again with either Kdump or PSTORE EFI to capture something. Previous experience has not made me hopeful in this regard.

Thanks,
Arthur

Comment 44 Arthur 2018-09-04 16:32:51 UTC

Ok, it looks like I spoke too soon. While trying to see if I could get a crash report I found out that I still get the sudden crashes with the DMC firmware unloaded and the computer unplugged from any dock or monitor. In other words, the DMC firmware fix does NOT appear to have fixed the issue. Sorry for the previous misleading information.

As before there are no logs or anything saved to PSTORE during these crashes.

Comment 45 Filip 2018-09-09 14:42:24 UTC

(In reply to Jani Nikula from comment #41)
> Please try kernel v4.18 or later with i915.dmc_firmware_path="" (or
> something that doesn't exist) to disable DMC firmware and consequently
> runtime PM.

Thanks for looking into this and for the new suggestion. 

Unfortunately I still do experience freezing. Tested with kernel 4.18.6 and checked that DMC is off the same way Arthur did.

Comment 46 Arthur 2018-09-14 16:55:37 UTC

Until the underlying issue is figured out/resolved I've made a patch to re-enable this parameter. I've been using it with kernel 4.18.7 for a few days with no problems and no crashes!

Check out
https://github.com/eigenbrot/re-enable_rc6

It's basically just the output of git revert on commit fb6db0f5bf1d4d3a4af6242e287fa795221ec5b8 of the linux kernel, but I did do a little polishing by hand to hopefully keep all of the improvements made to the i915 driver since the parameter was removed.

Comment 47 Timo Teräs 2019-01-02 06:15:31 UTC

I started getting same kind of freezes on my Dell Latitude 7390 (Intel Corporation UHD Graphics 620, rev 07) after upgrading kernel 4.14.89 to 4.19.13. So this may or may not be the same issue.

I can reliably reproduce by setting display to power off and waiting for about a minute. Powering display on immediately after power off seems to work. After freeze nothing works.

I am available to try to get logs or debug the issue. Any new ideas what to look at?

Comment 48 Lakshmi 2019-04-15 16:52:54 UTC

Anshuman, any help on this bug would be appreciated.

Comment 49 Anshuman Gupta 2019-07-16 04:38:58 UTC

Hi Timo,
Could you please provide dmesg logs (boot dmesg and dmesg with the freeze issue).
with  drm.debug=0xe .

Thnaks ,
Anshuman

Comment 50 Timo Teräs 2019-07-16 06:12:30 UTC

Created attachment 144792 [details]
various dmesg w/ drm.debug=0xe on dell laptop

Find attached the dmesg of boot up, successful power down/up and finally from the crash. After several attempts I managed to get two WARN_ON() back traces on it, so it might actually be very useful.

The power off is done by XFCE Power Manager: Settings > Power Manager > Display tab > Display power management (enabled, with switch off at 1 minutes). I then wait for 1 minute for power off. Power on is attempted by typing on keyboard.

It seems that sometimes the crash happens pretty soon after powering display off (e.g. system crash can happen earlier than 60 seconds). It seems to be total hang as I had a little loop copying uptime to a file and running 'sync'. And that stops usually around the time display wake up is attempted.

Comment 51 Imre Deak 2019-07-16 15:44:23 UTC

(In reply to Timo Teräs from comment #50)
> Created attachment 144792 [details]
> various dmesg w/ drm.debug=0xe on dell laptop
> 
> Find attached the dmesg of boot up, successful power down/up and finally
> from the crash. After several attempts I managed to get two WARN_ON() back
> traces on it, so it might actually be very useful.
> 
> The power off is done by XFCE Power Manager: Settings > Power Manager >
> Display tab > Display power management (enabled, with switch off at 1
> minutes). I then wait for 1 minute for power off. Power on is attempted by
> typing on keyboard.
> 
> It seems that sometimes the crash happens pretty soon after powering display
> off (e.g. system crash can happen earlier than 60 seconds). It seems to be
> total hang as I had a little loop copying uptime to a file and running
> 'sync'. And that stops usually around the time display wake up is attempted.

Could you try if you can still reproduce the same problem when booting with i915.enable_dc=0 ?

Comment 52 Timo Teräs 2019-07-16 17:20:54 UTC

(In reply to Imre Deak from comment #51)
> Could you try if you can still reproduce the same problem when booting with
> i915.enable_dc=0 ?

Seems that this improves things significantly for me. No crash so far. I will keep this for now, and see if a crash happens during longer time. Though earlier comments (1-15) do mention that it's no full solution for all.

Comment 53 George McCollister 2019-07-22 14:54:34 UTC

I'm able to reproduce this issue on five Atom E3845 based embedded systems in a lab.
With intel_idle.max_cstate=0 processor.max_cstate=0 I can get all systems to restart due to watchdog reset overnight.

Sometimes, but not always I have observed errors such as these on the serial console immediately prior to the system lockup/reboot:
[144039.363431] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000c00 (000ffc00)
[144039.476432] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 000000c0 (000ffcc0)
[144039.589165] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000000 (000fccc0)
[144039.702190] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000000 (000f0cc0)
[144039.814669] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000000 (000c0cc0)
[144039.925749] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000000 (00000cc0)
[144040.120485] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 00000c00 (00000cc0)
[144040.233084] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 000000c0 (00000cc0)
[144040.388315] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 000000c0 (00000cc0)
[144040.607674] [drm:vlv_set_power_well [i915]] *ERROR* timeout setting power well state 000000c0 (00000cc0)

Most commonly, only one of the above error messages (or none) is printed to the serial console.

On 4.14.x, adding i915.enable_rc6=0 allows the systems to run 3+ days (until I stop the test).

Hoping it would fix the problem I checked out and built kernel commit a75d035fedbdecf83f86767aa2e4d05c8c4ffd95. All systems still rebooted overnight.

I've since found that using i915.disable_power_well=0 also prevents the problem from occurring on all tested kernel versions. Is this setting less disruptive to the operation than i915.enable_rc6=0? Is there also value in testing "enable_dc=0"?

If any of the developers are working on this and think they have a fix, give me the URI of a git repo and the commit to use and I can build, test it in the lab. Also specify any kernel config settings and kernel command line arguments you want me to use.

Since someone might ask I'm using "intel_idle.max_cstate=0 processor.max_cstate=0" since these systems require minimal scheduling latency. I also noticed they can prevent other i915 lockup issues. I can remove them for testing purposes upon request.

Comment 54 Arthur 2019-08-19 16:35:33 UTC

Thanks for the comments Timo and George. I tested some of your solutions and am still experiences crashes on 5.2.8 with any combination of i915.disable_power_well=0, intel_idle.max_cstate=0, processor.max_cstate=0, or i915.enable_dc=0.

Comment 55 Lakshmi 2019-08-27 10:59:28 UTC

Anshuman, what are next steps? This issue needs an update at least once in a week considering high priority.

Comment 56 Martin Peres 2019-11-29 17:44:49 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/100.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.