Bug 88012 - [bisected BYT] complete freeze after: drm/i915/vlv: WA for Turbo and RC6 to work together
Summary: [bisected BYT] complete freeze after: drm/i915/vlv: WA for Turbo and RC6 to w...
Status: CLOSED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest major
Assignee: Kimmo Nikkanen
QA Contact: Intel GFX Bugs mailing list
URL: https://bugzilla.kernel.org/show_bug....
Whiteboard:
Keywords:
: 93214 (view as bug list)
Depends on:
Blocks:
 
Reported: 2015-01-04 10:07 UTC by Peter Frühberger
Modified: 2017-07-24 22:49 UTC (History)
40 users (show)

See Also:
i915 platform: BYT
i915 features: GPU hang


Attachments
dmesg output from boot till crash (drm.debug=0xe debug ignore_loglevel) (1.54 MB, text/plain)
2015-01-04 14:34 UTC, René Lange
no flags Details
Be more careful with punit reads (2.65 KB, patch)
2015-01-04 19:39 UTC, Ben Widawsky
no flags Details | Splinter Review
111723: dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel) (190.23 KB, text/plain)
2015-01-04 23:13 UTC, Juergen Froehler
no flags Details
dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel) (153.91 KB, text/plain)
2015-01-05 13:01 UTC, Juergen Froehler
no flags Details
ickle1 - dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel) (110.90 KB, text/plain)
2015-01-06 00:59 UTC, Juergen Froehler
no flags Details
ickle1 - dmesg output with trace on the end (drm.debug=0xe debug ignore_loglevel) (57.70 KB, text/plain)
2015-01-06 09:13 UTC, Juergen Froehler
no flags Details
dmesg with i915.enable_rc6=0 (3.19.0-rc2-ickle1+) (178.45 KB, text/plain)
2015-01-06 10:24 UTC, Juergen Froehler
no flags Details
no freeze - dmesg with intel_pstate=disable (3.19.0-rc2-ickle1+) (229.05 KB, text/plain)
2015-01-06 11:33 UTC, Juergen Froehler
no flags Details
no freeze - dmesg with governor=powersave (3.19.0-rc2-ickle1+) (344.27 KB, text/plain)
2015-01-07 22:03 UTC, Juergen Froehler
no flags Details
Crashlog GPU Hang on drm-intel-nightly 4.0.0 (1.15 MB, text/plain)
2015-04-28 22:55 UTC, Maxime Bergeron
no flags Details
Dmesg - drm-intel-nightly 4.0.0 (3.33 KB, text/plain)
2015-04-28 22:55 UTC, Maxime Bergeron
no flags Details
drm/i915/vlv: Take forcewake on media engine writes (2.57 KB, patch)
2015-12-17 15:15 UTC, Mika Kuoppala
no flags Details | Splinter Review
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes (2.01 KB, patch)
2015-12-18 13:04 UTC, Mika Kuoppala
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Peter Frühberger 2015-01-04 10:07:13 UTC
We experienced strange full system freezes on Asrock Q1900 hardware with our OpenELEC 5.0 release. No errors were visible via netconsole, the whole system just fully hung.

We then started to bisect between kernel 3.13 and 3.18 stable. It was verified before that 3.19-rc2 is also affected.

Commit: 31685c258e0b0ad6aa486c5ec001382cf8a64212 drm/i915/vlv: WA for Turbo and RC6 to work together 

was found to be the first bad commit in that bisect.

A manual workaround was to set the max cstate to C1 (via BIOS), which workarounded this bug.

We currently have > 10 users that are affected by this bug (mostly Asrock Q1900 users).

You can see the complete bisecting steps here: https://github.com/OpenELEC/OpenELEC.tv/issues/3726#issuecomment-68626603

I will ask that user to subscribe to this tracker. As we freeze very hard, it's not possible to add logfiles as the netconsole stays empty for us.
Comment 1 René Lange 2015-01-04 14:34:07 UTC
Created attachment 111723 [details]
dmesg output from boot till crash (drm.debug=0xe debug ignore_loglevel)

ASRock Q2900-ITX is affected, too.
Log is crated by using netconsole.
Comment 2 Ben Widawsky 2015-01-04 19:39:54 UTC
Created attachment 111734 [details] [review]
Be more careful with punit reads

It's a bit of a long shot, but let's see what happens. 
I have only compile tested this patch.
Comment 3 Juergen Froehler 2015-01-04 23:13:22 UTC
Created attachment 111739 [details]
111723: dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel)

Good day, I did the bisect, see attached my dmesg. 

System: Zotac CI320 Nano, FW Version 2K141128, Intel HD Graphics, Intel Celeron N2930 (quad-core, 1.83 GHz)
Comment 4 Chris Wilson 2015-01-05 10:34:22 UTC
I had some patches to improve the vlv rps: http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug88012

They incorporated the change Ben suggested and reduce the number of interrupts required by the manual RPS tuning, as well as making it much more responsive to gfx workload (not that byt has that great a range). It doesn't explain a system hang though...
Comment 5 Juergen Froehler 2015-01-05 13:01:24 UTC
Created attachment 111767 [details]
dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel)

I build the Kernel (3.18.1-bw1+) from Peters git with Ben Widawski experimental patch. Unfortunately I had the freeze / hung again after ~10 minutes of running a movie. Attached is the dmesg log via netconsole until the System freeze. 
If you need more Information or Logs - of course I will support as mutch is possible.
Comment 6 Peter Frühberger 2015-01-05 18:53:58 UTC
@Juergen Froehler:

Please give ickle's branch a try, I forked it on my github (as freedesktop's git was really slow in the past):

git clone https://github.com/fritsch/linux.git
git checkout bug88012
make localmodconfig
make-kpkg --append-to-version "-ickle1" --initrd linux-headers linux-image

And give it a good test.

Btw. As your base OS is Ubuntu 14.04, you might need to upgrade the linux-firmware (or ignore warnings about it a bit).
Comment 7 Juergen Froehler 2015-01-06 00:59:56 UTC
Created attachment 111800 [details]
ickle1 - dmesg output from boot to hung (drm.debug=0xe debug ignore_loglevel)

Hello,
first I updated the Ubuntu firmware to linux-firmware_1.140 and build the new Kernel based on ~ickle (3.19.0-rc2-ickle1+). The System hung was after 5 min runtime. The Logfile was created via netconsole from boot > freeze.
Comment 8 Chris Wilson 2015-01-06 09:07:15 UTC
Have you tried i915.enable_rc6=0? Or maybe using intel_pstate?
Comment 9 Juergen Froehler 2015-01-06 09:13:26 UTC
Created attachment 111836 [details]
ickle1 - dmesg output with trace on the end (drm.debug=0xe debug ignore_loglevel)

today morning I had this nice one, but this happend bevor I run any movie. I have to say that I want to do an refernce test (to see if the workaround still works) and limited in Bios the C State to C3.

Kernel: 3.19.0-rc2-ickle1+ (the one I build last night from ickle git)
Comment 10 Juergen Froehler 2015-01-06 09:33:48 UTC
(In reply to Chris Wilson from comment #8)
> Have you tried i915.enable_rc6=0? Or maybe using intel_pstate?

not this time, but I will do testing it now and give feedback soon.
Comment 11 Juergen Froehler 2015-01-06 10:24:24 UTC
Created attachment 111842 [details]
dmesg with i915.enable_rc6=0 (3.19.0-rc2-ickle1+)

Ok, here the result of the first test with i915.enable_rc6=0
I checked twice to be sure it was disabled
once in the dmesg:
[    2.626518] [drm] RC6 disabled, disabling runtime PM support
[  well it looks like the same as when I limit  3.799990] [drm:intel_print_rc6_info] Enabling RC6 states: RC6 off

and once in the parameters:
/sys/module/i915/parameters/enable_rc6=0

Well it looks like as the same when I limit in the Bios the C State to C3. There is a trace on the end of the attached Logfile and if I run an mkv it freeze after some minutes.

the next test will be with intel_pstate=disable
actually the settings looks like:
for i in /sys/devices/system/cpu/intel_pstate/*; do echo $i=$(cat $i); done
/sys/devices/system/cpu/intel_pstate/max_perf_pct=100
/sys/devices/system/cpu/intel_pstate/min_perf_pct=100
/sys/devices/system/cpu/intel_pstate/no_turbo=0
Comment 12 Juergen Froehler 2015-01-06 11:33:34 UTC
Created attachment 111844 [details]
no freeze - dmesg with intel_pstate=disable  (3.19.0-rc2-ickle1+)

This time I did a test with intel_pstate=disable. I had no freeze during a 45 minute run of a file which usually freeze. Anyway I attached the dmesg of it if you like to verify.
Comment 13 Ben Widawsky 2015-01-07 07:54:14 UTC
Do other governors also cause a hang? For instance:

for g in /sys/devices/system/cpu/cpu[0-9]/cpufreq/scaling_governor; do   echo powersave > $g;   echo cpu$i: $(cat $g);   ((i++)); done
Comment 14 Juergen Froehler 2015-01-07 09:19:10 UTC
(In reply to Ben Widawsky from comment #13)
> Do other governors also cause a hang? For instance:
> 
> for g in /sys/devices/system/cpu/cpu[0-9]/cpufreq/scaling_governor; do  
> echo powersave > $g;   echo cpu$i: $(cat $g);   ((i++)); done

Hello Ben, will do the test tonight and give then feedback
Comment 15 Juergen Froehler 2015-01-07 22:03:09 UTC
Created attachment 111932 [details]
no freeze - dmesg with governor=powersave (3.19.0-rc2-ickle1+)

Hello Ben,
Here the result of my test with governor=powersave. 
first I set the governor to powersave and checked it after reboot:

for g in /sys/devices/system/cpu/cpu[0-9]/cpufreq/scaling_governor; do echo $g=$(cat $g); done
/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor=powersave
/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor=powersave
/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor=powersave
/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor=powersave

Booted Kernel regular (without i915.enable_rc6=0 and without intel_pstate=disable)

Kernel build from ickle git: 3.19.0-rc2-ickle1+ #1 SMP Tue Jan 6 00:57:18 CET 2015 x86_64 x86_64 x86_64 GNU/Linux

I had run some files over a time of almost 2 hour now without a freeze. In the attached logfile there is just one Kernel trace (perhaps interesting for Ickle), but it seems to have no impact during the test. The CPU was during the run mostly like:
cat /proc/cpuinfo | grep "cpu MHz"
cpu MHz         : 499.741
cpu MHz         : 499.741
cpu MHz         : 499.741
cpu MHz         : 499.741

Well I will do another test now with the "regular" Kernel 3.17.7 and governor=powersave just to see if it freeze or also run more "stable".
Comment 16 Ben Widawsky 2015-01-07 22:51:39 UTC
(In reply to Juergen Froehler from comment #15)

> Well I will do another test now with the "regular" Kernel 3.17.7 and
> governor=powersave just to see if it freeze or also run more "stable".

Can you also confirm you are unable to hit this without GPU, and just CPU stress tests (I do not have any recommendations for which test)? I see a a similar sounding problem, but it is very intermittent for me.
Comment 17 Juergen Froehler 2015-01-07 23:40:44 UTC
(In reply to Ben Widawsky from comment #16)
> (In reply to Juergen Froehler from comment #15)
> 
> > Well I will do another test now with the "regular" Kernel 3.17.7 and
> > governor=powersave just to see if it freeze or also run more "stable".
> 
> Can you also confirm you are unable to hit this without GPU, and just CPU
> stress tests (I do not have any recommendations for which test)? I see a a
> similar sounding problem, but it is very intermittent for me.

What I can say and what I have most intensive tested on the generic Kernels (3.13.0 > 3.17.7 and also on the ickle Kernel 3.19.rc2 was to disable HW Acceleration (VAAPI) in Kodi and running movies over several hours without a freeze/hung. The hung happens only when HW Acceleration in Kodi is enabled.

In the meantime I was running 3.17.7-generic with governor=powersave for over an hour now without a freeze, but the logfile runs quickly full with the aggresiv drm:valleyview_set_rps but no freeze yet... will let it run some time more

I did also some days ago a long memtest run over 6 PASS 0 Errors which took over ~8 hours to be sure there is no HW issue.

Well for a CPU stress test, I will look around if there is something I can use without killing it
Comment 18 DDD 2015-01-08 05:19:11 UTC
Maybe Prime95 can be user for CPU Stress tests?
http://www.mersenne.org/download/#stresstest
Comment 19 Juergen Froehler 2015-01-08 11:09:14 UTC
(In reply to DDD from comment #18)
> Maybe Prime95 can be user for CPU Stress tests?
> http://www.mersenne.org/download/#stresstest

I plan the CPU stress test for tonight and will give feedback. Found several good Information for stress testing in the Ubuntu Wiki.
Comment 20 Juergen Froehler 2015-01-08 17:32:10 UTC
CPU stress test: using stress with a runtime of 1200 seconds which should be a nice cpu burn test if our intend is to figgure out if there is an CPU issue. Under normal circumstances with running Kodi 14 on this box I never have seen this high cpu usage over this time periode.

---
 stress --cpu 4 --timeout 1200
stress: info: [4387] dispatching hogs: 4 cpu, 0 io, 0 vm, 0 hdd
stress: info: [4387] successful run completed in 1200s


CPU > max speed (switched off Turbo mode in Bios to avoid killing my CPU)
cat /proc/cpuinfo | grep "cpu MHz"
cpu MHz         : 1832.600
cpu MHz         : 1832.600
cpu MHz         : 1832.600
cpu MHz         : 1832.600

top during test run
%Cpu(s):100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
4388 root      20   0    7316    100      0 R 100.0  0.0   19:49.05 stress
4389 root      20   0    7316    100      0 R 100.0  0.0   19:48.16 stress
4390 root      20   0    7316    100      0 R 100.0  0.0   19:52.30 stress
4391 root      20   0    7316    100      0 R  97.4  0.0   19:48.88 stress

sensors
coretemp-isa-0000
Adapter: ISA adapter
Core 0:       +58.0°C  (high = +105.0°C, crit = +105.0°C)
Core 1:       +58.0°C  (high = +105.0°C, crit = +105.0°C)
Core 2:       +62.0°C  (high = +105.0°C, crit = +105.0°C)
Core 3:       +61.0°C  (high = +105.0°C, crit = +105.0°C)
Comment 21 Juergen Froehler 2015-01-12 07:59:30 UTC
good day together,

over the weekend I had some time to do several tests and I like to share my findings.

1. I did several different CPU & memory stress tests and all went fine, therefor i think the CPU & memory itself is fine.

2. kernels tested between 3.13 - 3.16 runs stable no freeze

3. I tested several mainline generic Kernels between 3.17 > 3.19RC2 & the Ickle 3.19RC2 with governor=powersave & C6/7 enabled in Bios.

the findings are: it runs more stable, freeze are very sporadic happens. I was not able to figure out under which circumstances it happens or not. Sometimes the test files runs over 2 hours without a freeze, sometimes 4 freezes in 1 hour. The Logfiles give no hint about the freeze.

I am very sorry that I was not able to get more Logfile information out of the System, but if you have more test Scenarios - I am glad to support.
Comment 22 Juergen Froehler 2015-02-04 16:34:08 UTC
good day together,

I kindly ask all, if there is something we can do to push this Topic a bit forward. As I already wrote I will support as far as possible.

kind regards
Comment 23 Jani Nikula 2015-02-05 10:43:37 UTC
So the regressing commit is

commit 31685c258e0b0ad6aa486c5ec001382cf8a64212
Author: Deepak S <deepak.s@linux.intel.com>
Date:   Thu Jul 3 17:33:01 2014 -0400

    drm/i915/vlv: WA for Turbo and RC6 to work together.

Deepak, Ville, do you have any ideas?
Comment 24 Chris Wilson 2015-02-05 10:48:58 UTC
That bisect appears to be a red herring though.
Comment 25 Andy Furniss 2015-02-05 11:22:47 UTC
(In reply to Chris Wilson from comment #24)
> That bisect appears to be a red herring though.

As someone who is also affected by this I can well believe that.

Though I wasn't bisecting Kernel at the time (and still haven't)  I've had runs of 12 hours without a lock - the same setup locked < 2H next day.

Having test quite hard since this bug was filed with released kernels I am 99.9% sure the issue is between 3.16.7 and 3.17.0.

I also think that gpu load is needed  - I use LFS and have compiled plenty on bad kernels + mprime torture test and never locked.

Do you have any guesses of what the bad commit could be between those - if you have then people could test that and someone will hopefully call bad quickly then extended test on the one before.
Comment 26 Juergen Froehler 2015-02-05 23:03:35 UTC
(In reply to Chris Wilson from comment #24)
> That bisect appears to be a red herring though.

well to be honest - yes can be a red herring, because to make the decision if the Biscet step was good or bad wasn't easy, but I have tested each step at least >2 hours, but mostly the freeze was much earlier . However, what I 100% can say is, I am running my device with Ubuntu 14.04.1 & Kernel 3.16.7-031607-generic now since 20 days as my daily beast without any freeze/hang and of course with VAAPI HW Acceleration enabled in kodi and C6/7 Idle state enabled in Bios - therefore I believe it is not a Hardware issue. With a 3.17x Kernel no way... it start freeze with the same settings under 1 hour.

If someone has a suggestion or Idea how we can narrow down this issue - I am glad to support and test.
Comment 27 René Lange 2015-02-06 05:52:20 UTC
(In reply to Chris Wilson from comment #24)
> That bisect appears to be a red herring though.

With the help of peter, i built 2 kernels about 30 days ago. 
First one with git reset --hard 31685c258e0b0ad6aa486c5ec001382cf8a64212
Second one by a followed git revert
31685c258e0b0ad6aa486c5ec001382cf8a64212

The first one crashed every time i was testing it, the second one was running fine for a few hours and didn't crash at all. If you want to, i can test the second one as my standard kernel to be surer, that this commit is the right one.
Comment 28 Juergen Froehler 2015-02-10 15:59:54 UTC
I just tested latest mainline Kernel 3.19.0 to confirm the freeze still exist.  unfortunately this issue exist now since 3.17.x. 

kind regards Juergen
Comment 29 wappy 2015-02-14 15:15:32 UTC
Is not only the q1900/j1900 also the j1800 same problem from time to time, just hangs without any errors.
Comment 30 sal coedmen 2015-03-05 17:28:37 UTC
same shit j1900  shit intel never again
Comment 31 Jesse Barnes 2015-03-05 18:15:23 UTC
Juergen sounds certain that this commit affects this issue, and I can believe it.

The punit provides several services, including CPU and GPU power management, and the code in question changes how we interact with the Punit to a degree.

So it's possible a BIOS upgrade (which would include a new Punit firmware) might help.

It's also possible that we're not validating the result of Deepak's code enough and end up feeding some bad values to the Punit as as result of the new calculations.

Or the simple fact that we're reading a new Punit reg fairly frequently is enough to cause trouble.  In that case, throttling the vlv_c0_residency reads of the CZ timestamp may be enough to avoid this.  (I don't think the C0 count reads should cause trouble, but it's possible they trigger additional punit activity as well, just by being enabled for read out in the control reg.)

Deepak, Ben, or Chris, any other ideas?
Comment 32 Andy Furniss 2015-03-05 20:30:51 UTC
I don't know about others but for me using Asrock Q1900dc-itx I put the latest bios (1.20) on as soon as I got it - there is nothing newer as of today.

I hadn't tried a new kernel since mid Jan (a nightly) but did today and todays nightly and fixes don't boot getting

ahci failed to stop engine then oops.

Haven't had time to see when it changed. Can boot with pci=nocrs.
Comment 33 Deepak S 2015-03-06 03:31:48 UTC
Hi Jesse,

I am suspecing the voltage change after GPU frequencey request.

Can we try below options.
1. Keep the frquency at min (RPn) & run the workload. This will ensure we run at contant GPU voltage.
a) cat /sys/class/drm/card0/gt_RPn_freq_mhz
b) echo "value from above cmd" >/sys/class/drm/card0/gt_max_freq_mhz

2) Switch back to legacy turbo. 
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 9baecb7..0dac413 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -4292,12 +4292,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv)
        INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
        INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);
 
-       /* Let's track the enabled rps events */
-       if (IS_VALLEYVIEW(dev_priv) && !IS_CHERRYVIEW(dev_priv))
-               /* WaGsvRC0ResidencyMethod:vlv */
-               dev_priv->pm_rps_events = GEN6_PM_RP_UP_EI_EXPIRED;
-       else
-               dev_priv->pm_rps_events = GEN6_PM_RPS_EVENTS;
+       dev_priv->pm_rps_events = GEN6_PM_RPS_EVENTS;
 
        INIT_DELAYED_WORK(&dev_priv->gpu_error.hangcheck_work,
                          i915_hangcheck_elapsed);
Comment 34 Andy Furniss 2015-03-06 10:32:08 UTC
(In reply to Deepak S from comment #33)
> Hi Jesse,
> 
> I am suspecing the voltage change after GPU frequencey request.
> 
> Can we try below options.
> 1. Keep the frquency at min (RPn) & run the workload. This will ensure we
> run at contant GPU voltage.
> a) cat /sys/class/drm/card0/gt_RPn_freq_mhz
> b) echo "value from above cmd" >/sys/class/drm/card0/gt_max_freq_mhz

This alone does not fix for me - if anything it locked sooner, but then I only did 2 runs.

Will try patch alone soon.
Comment 35 Andy Furniss 2015-03-06 11:26:19 UTC
(In reply to Andy Furniss from comment #32)

> ahci failed to stop engine then oops.
> 
> Haven't had time to see when it changed. Can boot with pci=nocrs.

In case anyone else testing on ASrock Qxxxx hits this, I bisected and there is already a bug filed -

https://bugzilla.kernel.org/show_bug.cgi?id=94221
Comment 36 Juergen Froehler 2015-03-06 16:17:14 UTC
(In reply to Deepak S from comment #33)
> Hi Jesse,
> 
> I am suspecing the voltage change after GPU frequencey request.
> 
> Can we try below options.
> 1. Keep the frquency at min (RPn) & run the workload. This will ensure we
> run at contant GPU voltage.
> a) cat /sys/class/drm/card0/gt_RPn_freq_mhz
> b) echo "value from above cmd" >/sys/class/drm/card0/gt_max_freq_mhz

Hi together,

did testing Option 1 - but still the System freeze and no chance to get some relevant output in the log.

~# cat /sys/class/drm/card0/gt_RPn_freq_mhz
167
~# cat /sys/class/drm/card0/gt_max_freq_mhz
854
~# echo "167" >/sys/class/drm/card0/gt_max_freq_mhz                                                                                                     ~# cat /sys/class/drm/card0/gt_max_freq_mhz
167

kind regards
Juergen
Comment 37 Peter Frühberger 2015-03-06 16:41:46 UTC
Here is an OpenELEC build with option 2) integrated: https://dl.dropboxusercontent.com/u/55728161/OpenELEC-Generic.x86_64-devel-20150306172724-r20368-gb822824.tar

Kernel 3.19 is used
Comment 38 Andy Furniss 2015-03-06 23:00:19 UTC
(In reply to Deepak S from comment #33)

> Can we try below options.

> 2) Switch back to legacy turbo. 

2 is good for me so far, been running almost 12 hrs.
Comment 39 Juergen Froehler 2015-03-06 23:43:23 UTC
(In reply to Peter Frühberger from comment #37)
> Here is an OpenELEC build with option 2) integrated:
> https://dl.dropboxusercontent.com/u/55728161/OpenELEC-Generic.x86_64-devel-
> 20150306172724-r20368-gb822824.tar
> 
> Kernel 3.19 is used

This Version runs now >2 hour without an freeze. I let it run now over night and give feedback tomorrow.

kind regards
Juergen
Comment 40 Peter Frühberger 2015-03-07 08:02:45 UTC
@Deepak S:

What are the disadvantages for other intel processors? Can we savely include this patch in our 3.17.x backports without introducing regressions for other non BYT intel hardware?
Comment 41 Juergen Froehler 2015-03-07 08:25:36 UTC
(In reply to Juergen Froehler from comment #39)
> (In reply to Peter Frühberger from comment #37)
> > Here is an OpenELEC build with option 2) integrated:
> > https://dl.dropboxusercontent.com/u/55728161/OpenELEC-Generic.x86_64-devel-
> > 20150306172724-r20368-gb822824.tar
> > 
> > Kernel 3.19 is used
> 
> This Version runs now >2 hour without an freeze. I let it run now over night
> and give feedback tomorrow.
> 
> kind regards
> Juergen

Ok it runs now over 9 hours continuously without a freeze
@Deepak S - it looks like you hit the bull's eye
Comment 42 Deepak S 2015-03-07 08:41:17 UTC
@Peter Frühberger, The changes is specific to BYT. it should not impact any other platform.

@Jesse, Shall we enable legacy turbo on BYT until we have rootcause on BYT WA?
Also, Chris has submitted a cleaned up patch for "WA for Turbo and RC6 to work together" for review.

Thanks
Deepak
Comment 43 Alex N 2015-03-14 20:32:43 UTC
(In reply to Juergen Froehler from comment #41)
> (In reply to Juergen Froehler from comment #39)
> > (In reply to Peter Frühberger from comment #37)
> > > Here is an OpenELEC build with option 2) integrated:
> > > https://dl.dropboxusercontent.com/u/55728161/OpenELEC-Generic.x86_64-devel-
> > > 20150306172724-r20368-gb822824.tar
> > > 
> > > Kernel 3.19 is used
> > 
> > This Version runs now >2 hour without an freeze. I let it run now over night
> > and give feedback tomorrow.
> > 
> > kind regards
> > Juergen
> 
> Ok it runs now over 9 hours continuously without a freeze
> @Deepak S - it looks like you hit the bull's eye

Unfortunately I had a freeze after about 4 hours :-(
Comment 44 Alex N 2015-03-14 20:46:49 UTC
S(In reply to Alex N from comment #43)
> (In reply to Juergen Froehler from comment #41)
> > (In reply to Juergen Froehler from comment #39)
> > > (In reply to Peter Frühberger from comment #37)
> > > > Here is an OpenELEC build with option 2) integrated:
> > > > https://dl.dropboxusercontent.com/u/55728161/OpenELEC-Generic.x86_64-devel-
> > > > 20150306172724-r20368-gb822824.tar
> > > > 
> > > > Kernel 3.19 is used
> > > 
> > > This Version runs now >2 hour without an freeze. I let it run now over night
> > > and give feedback tomorrow.
> > > 
> > > kind regards
> > > Juergen
> > 
> > Ok it runs now over 9 hours continuously without a freeze
> > @Deepak S - it looks like you hit the bull's eye
> 
> Unfortunately I had a freeze after about 4 hours :-(

Sorry, just recognized, that my system hasn't been updated correctly!
Comment 45 Juergen Froehler 2015-03-15 06:17:43 UTC
Ok, now I use the patched Kernel from Peter 3.19.1-legacy-turbo+ since 1 week and had no freeze. Therefore I would say this works so far as a interim fix until the root cause is found.

If you have new findings or upcoming patches to test out I am glad to support as much as I can.

kind ragrds
Juergen
Comment 46 Daniel Vetter 2015-03-18 11:22:28 UTC
Deepak, can you pls submit a proper patch for option 2), maybe restricted to just vlv to intel-gfx? Hanging machines are a pretty serious regression, I'd like to see this resolved.

We'd need to make sure that this isn't an issue on chv ofc, but that can happen after the functional revert.
Comment 47 Deepak S 2015-03-18 14:52:53 UTC
@Daniel, I will submit the patch & Also, WA not enabled for CHV so there should be any problem.

Btw, Chris has cleaned up patches for "WA for Turbo and RC6 to work" should we try that?
Comment 48 Chris Wilson 2015-03-18 14:55:40 UTC
They are worth trying again afterwards. I don't think they avoid the fundamental issue here which appears to be the PCU itself.
Comment 49 Andy Furniss 2015-03-19 16:28:09 UTC
(In reply to Deepak S from comment #47)
> @Daniel, I will submit the patch & Also, WA not enabled for CHV so there
> should be any problem.
> 
> Btw, Chris has cleaned up patches for "WA for Turbo and RC6 to work" should
> we try that?

I noticed some new patches went into a nightly (18th).

c4d390d drm/i915: Use down ei for manual Baytrail RPS calculations
168ebd7 drm/i915: Improved w/a for rps on Baytrail

It's a bit early to say anything conclusive, but I have so far not locked running that, but then I only did a few hours yesterday + currently up to 7 today.
Comment 50 Andy Furniss 2015-03-25 14:45:47 UTC
(In reply to Andy Furniss from comment #49)
> (In reply to Deepak S from comment #47)
> > @Daniel, I will submit the patch & Also, WA not enabled for CHV so there
> > should be any problem.
> > 
> > Btw, Chris has cleaned up patches for "WA for Turbo and RC6 to work" should
> > we try that?
> 
> I noticed some new patches went into a nightly (18th).
> 
> c4d390d drm/i915: Use down ei for manual Baytrail RPS calculations
> 168ebd7 drm/i915: Improved w/a for rps on Baytrail
> 
> It's a bit early to say anything conclusive, but I have so far not locked
> running that, but then I only did a few hours yesterday + currently up to 7
> today.

I've done many hours of running since and I am still stable.
Comment 51 Jesse Barnes 2015-03-25 21:33:38 UTC
Ok, looks like we worked around this one then with the commits mentioned.  Thanks a lot for testing Juergen.
Comment 52 Juergen Froehler 2015-03-29 08:12:43 UTC
Thank you all for supporting
here my personal summary after long time test period:  

mainline kernel between 3.13 -> 3.16 do not have the freeze issue
every mainline Kernel between 3.17.x -> 3.19.2 the freeze appear fast & frequently
mainline Kernel 3.19.3 (without legacy turbo fix) - rarely random freeze (I had just one in 4 days - still early to say more) but less as before
patched Kernel 3.19.x + legacy turbo fix - running rock solid = no freeze over long time period

therefore the Kernel with the legacy turbo fix is for me in the moment the best result for daily usage.

I did not test any of the 4.x Kernels yet - if needed I will do.

kind regards
Juergen
Comment 53 Juergen Froehler 2015-03-31 06:32:02 UTC
a short update & feedback from my side, perhaps it might be worth knowing. I had time to run the latest mainline Kernel 4.0.0-040000rc5.201503230035 during the last 2 days and my findings are that the freeze still exist.

kind regards
Juergen
Comment 54 Andy Furniss 2015-03-31 07:34:28 UTC
(In reply to Juergen Froehler from comment #53)
> a short update & feedback from my side, perhaps it might be worth knowing. I
> had time to run the latest mainline Kernel 4.0.0-040000rc5.201503230035
> during the last 2 days and my findings are that the freeze still exist.

From what I can see the fixes above that I am still running aren't in drm-intel-fixes so I guess not anything mainline? They are in drm-intel-next-fixes.
Comment 55 Andy Furniss 2015-04-08 17:15:22 UTC
Todays nightly 2015-04-08 locks again.

I've been running nightly from 03-18 without issue till now - tested new kernel as I noticed that some more Baytrail changes went in eg.

Agressive downclocking on Baytrail

I'll try reverting it and running later.

FWIW when I hard lock the picture is always still on screen - just thought I'd mention it.
Comment 56 Andy Furniss 2015-04-08 18:34:48 UTC
(In reply to Andy Furniss from comment #55)
> Todays nightly 2015-04-08 locks again.

> Agressive downclocking on Baytrail
> 
> I'll try reverting it and running later.

Still locks with that reverted.
Comment 57 Juergen Froehler 2015-04-08 22:06:42 UTC
@ Andy
I still use heavily the patched 3.19.1 kernel from Fritsch as my daily beast without any freeze. 
And to confirm - same on my device when the freeze happens within the unpatched Kernels the last pictures is visible - it looks like just "frozen"
Comment 58 Andy Furniss 2015-04-08 22:56:48 UTC
(In reply to Juergen Froehler from comment #57)
> @ Andy
> I still use heavily the patched 3.19.1 kernel from Fritsch as my daily beast
> without any freeze. 
> And to confirm - same on my device when the freeze happens within the
> unpatched Kernels the last pictures is visible - it looks like just "frozen"

Yea, I was stable with the patch on here or with the nightly that didn't have the patch but did have the commits I mentioned above.

Something regressed - It seems trying to bisect the nightly tree isn't going to work - the first try was bad and I got "the merge base xxxx is bad this means the bug was fixed between xxxx and yyyy" :-(
Comment 59 Andy Furniss 2015-04-10 18:43:12 UTC
(In reply to Andy Furniss from comment #56)
> (In reply to Andy Furniss from comment #55)
> > Todays nightly 2015-04-08 locks again.
> 
> > Agressive downclocking on Baytrail
> > 
> > I'll try reverting it and running later.
> 
> Still locks with that reverted.

I tried again a bisect on a different branch = drm-intel-next-queued

I managed to arrange not to hit any merges and the bisect did call 

8fb55197e64d5988ec57b54e973daeea72c3f2ff
drm/i915: Agressive downclocking on Baytrail

In fact while sitting on that commit for the first time ever I locked without the use of kodi. Just fast scrolling in a maximised  xterm from a make modules_install.

Generally the locks were much quicker than I am used to - 5-10 mins with kodi.

Just to confuse things, on the older nightly, as I said above, I still locked with this reverted - on the new branch (which has more new commits since I tested the nightly) I so far haven't locked with it reverted.
Comment 60 Maxime Bergeron 2015-04-13 19:23:20 UTC
Q1900DC-ITX here. 
Been having GPU hangs since 3.19 on kodi/chrome, but it stopped right after a self-compiled 4.0.0-rc6 kernel from drm-intel-nightly (right before the 70 patch set by Chris Wilson). I can confirm that the newer >RC7 regressed and the GPU hangs "seems" to happen quicker. I also noticed some serious intermittent stuttering on some videos (ie. CBS.com online) every ~1-2 minutes with the patchset. I can provide logs if required.
Comment 61 Andy Furniss 2015-04-13 20:14:50 UTC
(In reply to Andy Furniss from comment #59)

> Just to confuse things, on the older nightly, as I said above, I still
> locked with this reverted.

I recreated the test on nightly where I thought I still locked with 

8fb55197e64d5988ec57b54e973daeea72c3f2ff
drm/i915: Agressive downclocking on Baytrail

reverted and I didn't lock, so it seems I messed up somewhere for that test initially.

So reverting above alone does make me stable on both the nightly I first tested with and drm-intel-next-queued (tested as it was over the weekend).
Comment 62 Maxime Bergeron 2015-04-15 13:36:51 UTC
(In reply to Andy Furniss from comment #61)
> 8fb55197e64d5988ec57b54e973daeea72c3f2ff
> drm/i915: Agressive downclocking on Baytrail
> 
> reverted and I didn't lock, so it seems I messed up somewhere for that test
> initially.
> 
> So reverting above alone does make me stable on both the nightly I first
> tested with and drm-intel-next-queued (tested as it was over the weekend).

Maybe not. I just tried this (latest drm-intel-next-queued with the commit reverted) and I locked after ~2 hours uptime (couldn't get logs, everything hung up including ssh). Definitely more stable without the commit (less stutter in 1080p video playback), but I had over 50 hours of uptime with 4.0.0-RC6 without any issue. Maybe there's something wrong elsewhere ?
Comment 63 Andy Furniss 2015-04-16 17:16:56 UTC
(In reply to Maxime Bergeron from comment #62)
> (In reply to Andy Furniss from comment #61)
> > 8fb55197e64d5988ec57b54e973daeea72c3f2ff
> > drm/i915: Agressive downclocking on Baytrail
> > 
> > reverted and I didn't lock, so it seems I messed up somewhere for that test
> > initially.
> > 
> > So reverting above alone does make me stable on both the nightly I first
> > tested with and drm-intel-next-queued (tested as it was over the weekend).
> 
> Maybe not. I just tried this (latest drm-intel-next-queued with the commit
> reverted) and I locked after ~2 hours uptime (couldn't get logs, everything
> hung up including ssh). Definitely more stable without the commit (less
> stutter in 1080p video playback), but I had over 50 hours of uptime with
> 4.0.0-RC6 without any issue. Maybe there's something wrong elsewhere ?

Yea, I updated yesterday after seeing this and did manage to lock next-queued.

Possibly not anything recent, though,  as it seems whether I lock or not now depends on how I test - 1080i30 (+deint) with some 1080p60 on 60Hz display = lock. I had been testing before with 1080p24 or 1080i25 and retried like this today - it's still running after 9 Hours.

Given the above the next commit I will try reverting in addition to aggressive downclock =

6ad790c0f5ac55fd13f322c23519f0d6f0721864
drm/i915: Boost GPU frequency if we detect outstanding pageflips

and I will run samples where frame/field rate = refresh.
Comment 64 Andy Furniss 2015-04-28 09:39:22 UTC
(In reply to Andy Furniss from comment #63)
> (In reply to Maxime Bergeron from comment #62)
> > (In reply to Andy Furniss from comment #61)
> > > 8fb55197e64d5988ec57b54e973daeea72c3f2ff
> > > drm/i915: Agressive downclocking on Baytrail
> > > 
> > > reverted and I didn't lock, so it seems I messed up somewhere for that test
> > > initially.
> > > 
> > > So reverting above alone does make me stable on both the nightly I first
> > > tested with and drm-intel-next-queued (tested as it was over the weekend).
> > 
> > Maybe not. I just tried this (latest drm-intel-next-queued with the commit
> > reverted) and I locked after ~2 hours uptime (couldn't get logs, everything
> > hung up including ssh). Definitely more stable without the commit (less
> > stutter in 1080p video playback), but I had over 50 hours of uptime with
> > 4.0.0-RC6 without any issue. Maybe there's something wrong elsewhere ?
> 
> Yea, I updated yesterday after seeing this and did manage to lock
> next-queued.
> 
> Possibly not anything recent, though,  as it seems whether I lock or not now
> depends on how I test - 1080i30 (+deint) with some 1080p60 on 60Hz display =
> lock. I had been testing before with 1080p24 or 1080i25 and retried like
> this today - it's still running after 9 Hours.
> 
> Given the above the next commit I will try reverting in addition to
> aggressive downclock =
> 
> 6ad790c0f5ac55fd13f322c23519f0d6f0721864
> drm/i915: Boost GPU frequency if we detect outstanding pageflips
> 
> and I will run samples where frame/field rate = refresh.

Time passes -  I had been slowly trying to find a guilty commit, but I gave up as the history for drm-intel-next-queued looks totally different depending where I am so it's hard to find anything.

I can lock on the commit before Agressive downclocking on Baytrail but not with kodi - the only way I found was "make modules_install" which is quite strange - I made a prog that scrolls at variable rates but that didn't work.

Trying to test with make going back in the history didn't get very far as I soon found that history is inconsistent due to the merges so I would test a commit (git reset --hard) fail, look at the history and choose an earlier commit then find that when reset on that the history was totally different and I was testing without the commits that "fixed" the issue in the first place 

c4d390d drm/i915: Use down ei for manual Baytrail RPS calculations
168ebd7 drm/i915: Improved w/a for rps on Baytrail

even though the previous history/log had them way down after the new place I wanted to try.
Comment 65 Maxime Bergeron 2015-04-28 22:54:38 UTC
(In reply to Andy Furniss from comment #64)
> Trying to test with make going back in the history didn't get very far as I
> soon found that history is inconsistent due to the merges so I would test a
> commit (git reset --hard) fail, look at the history and choose an earlier
> commit then find that when reset on that the history was totally different
> and I was testing without the commits that "fixed" the issue in the first
> place 
> 
> c4d390d drm/i915: Use down ei for manual Baytrail RPS calculations
> 168ebd7 drm/i915: Improved w/a for rps on Baytrail
> 
> even though the previous history/log had them way down after the new place I
> wanted to try.

Yes indeed it gets complicated with merges. 
Personally if I compile virgin/testing/drm-intel as of today, I get a GPU hang on kodi boot (attached dmesg-4.0.0 and crashlog-4.0.0) with a segmentation fault.
Else, if I revert before the patchset including:

8fb55197e64d5988ec57b54e973daeea72c3f2ff
drm/i915: Agressive downclocking on Baytrail 

It does work, although with the patchset too but it ends up hanging with >=1080p videos. That's weird as this commit doesn't seem to be linked to the original problem, so it's like if this was simply exacerbating another underlying, older issue that might've been missed. For now I'm running 4.1 from Linus github and it works fine...for now.
Comment 66 Maxime Bergeron 2015-04-28 22:55:20 UTC
Created attachment 115414 [details]
Crashlog GPU Hang on drm-intel-nightly 4.0.0
Comment 67 Maxime Bergeron 2015-04-28 22:55:58 UTC
Created attachment 115415 [details]
Dmesg - drm-intel-nightly 4.0.0
Comment 68 Andy Furniss 2015-05-04 10:22:14 UTC
I tried a kernel.org 4.1-rc1 tar over the weekend and though I didn't lock with kodi, I could quite easily lock with a few "make modules_install" in a row. I do this after kodi has been running some time. Of course my ddx and measa are new so likely different to other peoples - but I have so far still failed to lock 3.16.7 using the same test method with the same currentish ddx/mesa.
Comment 69 Juergen Froehler 2015-05-05 05:46:50 UTC
Well I have the next days some free time, therefore I am able to do  some tests. On which Tree I should jump to do Kernel testing on my device to get a qualified feedback for the Devs?
Comment 70 Jesse Barnes 2015-07-29 15:15:58 UTC
Deepak, any update here?
Comment 71 Deepak S 2015-07-29 15:30:02 UTC
Hi Jesse,

I thought improved rps patches from Chris helped us to resolve the issue.

Can enable the legacy turbo back and see if it helps?
diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 9baecb7..0dac413 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -4292,12 +4292,7 @@ void intel_irq_init(struct drm_i915_private *dev_priv)
        INIT_WORK(&dev_priv->rps.work, gen6_pm_rps_work);
        INIT_WORK(&dev_priv->l3_parity.error_work, ivybridge_parity_work);
 
-       /* Let's track the enabled rps events */
-       if (IS_VALLEYVIEW(dev_priv) && !IS_CHERRYVIEW(dev_priv))
-               /* WaGsvRC0ResidencyMethod:vlv */
-               dev_priv->pm_rps_events = GEN6_PM_RP_UP_EI_EXPIRED;
-       else
-               dev_priv->pm_rps_events = GEN6_PM_RPS_EVENTS;
+       dev_priv->pm_rps_events = GEN6_PM_RPS_EVENTS;
 
        INIT_DELAYED_WORK(&dev_priv->gpu_error.hangcheck_work,
                          i915_hangcheck_elapsed);


based on the comments looks like we are hitting the issue after enabling aggressive downclocking. I will check the patch again to see if we can potential fix. 

8fb55197e64d5988ec57b54e973daeea72c3f2ff
drm/i915: Agressive downclocking on Baytrail


Thanks
Deepak
Comment 72 LeKodeur 2015-09-17 07:10:15 UTC
Have been subjected to this segment fault after performing a kernel update last week to 3.13.0-64-generic. The seg fault would particularly be prevalent under Kodi 15.x when viewing video material. My dmseg: http://pastebin.com/Gc7R4X5u

Installed 'Fritsch' custom kernel (incorporating his 'legacy turbo fix' 3.19.2-legacy+edid+) from his post at http://forum.kodi.tv/showthread.php?tid=238447, and that fixed the issue for me.
Comment 73 Peter Frühberger 2015-09-17 07:16:16 UTC
Wait: A segmentation fault is something completely different than what is discussed in this bugreport. From your forum post I figured you get full freeze of your system. Does it full freeze or do you get a segfault?

If segfault -> post that log, then your bug is something else.
Comment 74 LeKodeur 2015-09-17 07:46:07 UTC
My apologies for my use of incorrect terminology ... yes, 'full system freezes' was the term I should have used!
Comment 75 Maxime Bergeron 2015-09-22 16:50:23 UTC
My comment wont be very helpful... I tried kernel 4.1.6 from kernel.org and it doesnt freeze, but kernel 4.2 does (both selfcompiled). I then tried the legacy turbo patch on 4.2 and although it seems to last longer it does end with a full system freeze during video playback. Both used on the same system with baytrail i915 with nightly mesa/drivers.
Comment 76 Anael 2015-09-22 17:37:53 UTC
@Maxime Bergeron:
It is helpful, I was recently wondering if I should switch back to the latest kernel on my Archlinux. Now I know I should better stick with the 3.14 LTS. Thanks.
Comment 77 Yichao Zhou 2015-09-22 21:35:58 UTC
I can confirm that "legacy turbo" patch doesn't work for me either.  Using that patch with ck-kernel, my system still freezes under high CPU-load frequently.  The last kernel works for me is the 3.18.x branch.
Comment 78 benno 2015-10-06 14:54:49 UTC
Hi, I can confirm the freeze also on a BYT (Zotac Pico 320, Intel Atom Z3735F). I first used the patched Kernel 3.19.1 (with legacy turbo fix) and got rarely random freezes like 1 per week, but after some updates I got several daily freezes again. I used then the patched Kernel 3.19.2 (with legacy turbo fix + edid) and it freezes less but still 3-4 freezes a week. Thanks for your great work on this!
Comment 79 viktor 2015-10-08 06:44:31 UTC
With kernel 4.1, the system was relatively stable. I would probably get 1 freeze in a couple of weeks. Upgrading to 4.2 causes massive freezes, both during playback and if kodi just shows its first screen. Freezes happen within an hour of playback, and maybe within 12 if nothing is being played.
Comment 80 Wouter Driessen 2015-10-17 12:22:32 UTC
I'll just chime in, as I also notice a huge difference between kernel 4.1 and 4.2. On my Shuttle XS36V4 (Celeron J1900), running arch completely up-to-date, I can hardly watch 15 minutes of video before it all locks up. After reverting the kernel back to 4.1.6-1, the system is quite stable again.
Comment 81 Andy Furniss 2015-10-19 10:18:37 UTC
My Asrock Q1900DC was originally bought to be a headless router/pvr/nas which it now is - so no more testing of this lock from me for a long time (or so I thought).

When putting it to its new duties I put a vanilla 4.1.1 on it (didn't patch as being headless I don't get any i915 interrupts). All was good - uptime 100 days varied CPU loads no issues.

USB has some xhci isoc pstate issues which were worked around by disabling USB3 in bios to force ehci driver. This issue was low level packet loss from dvb tuners not locks.

Recently needed to re-locate and while doing so updated to 4.1.10 = hard lock after 7 days uptime. The kernel was not the only difference as I attached a usb printer and so have usb module and cups running now, though the printer had been off for days when it locked.

Anyway I am back on 4.1.1 now (with printer) and will have to see how long it lasts to be sure whether the kernel or the printer (or the move!) was the cause.
Comment 82 Michal Feix 2015-10-21 17:59:27 UTC
I'm seeing very similar symptoms on my Celeron N2940 sysstem. I'm using Arch distro with kernel 4.2.3-1. The system freezes from time to time when playing videos, especialy when using HW acceleration. It usually happens when playing videos from mpv with VAAPI HW decoding enabled or flash videos in Firefox with HW decoding enabled in flash config file. System freezes usually in 10 to 30 minutes of playback. Playing videos with no HW decoding means less freezes of my system, but not avoiding them. It just happens less frequent and from time to time, even when no videoplayback is running in my LXDE.

My best guess is that this behavior started somewhere between kernel version 4.0.7 and 4.1.6. Unfortunatelly, I can't be more specific, as it took me more than two months stresstesting CPU and memory before I pinpointed this problem as most probably connected with heavy GFX usage.

I already tried a few options with no luck, like i915.reset=0/1, i915.enable_rc6=0/1 and i915.semaphores=0/1. I couldn't feel any difference, except with enable_rc6=9. System was even less stable then.

Using drm.debug=1 did not produce any intereseting messages before freeze. It's filled with "random" I915_GEM_BUSY, I9!5_GEM_EXECBUFFER2 and I915_GEM_MADVISE messages up to freeze. Of course, I can post the log if someone feels it's interested anyway.

I'm willing to offer more help with debuging this issue.
Comment 83 ladiko 2015-10-21 19:21:22 UTC
Unrelated to this bug but people who return to 3.16 or 3.13 on Ubuntu may use 3.13.0-65 or 3.16.0-50 due to this bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1503655
Comment 84 ladiko 2015-10-22 05:41:06 UTC
I just saw that the other issue seems to be fixed recently, sorry for the disturbance. Just interesting how long this bug exists unfixed while the other one has been fixed quite fast.
Comment 85 Vladimir Jicha 2015-10-22 08:18:21 UTC
I also don't understand that such serious bug hasn't been fixed yet. Does anybody at Intel even care about it?
Comment 86 sagir3 2015-10-26 03:42:55 UTC
Also affecting Debian Jessie's stock kernel, 3.16.0-4-amd64.

I am running on a Thinkpad T430 with i5-3320M 2.6GHz Ivybridge CPU. Under sustained load, the whole computer will freeze (no ssh, no keyboard inputs, no nothing), within a period of 4 to 4.5 hours.

Using YouTube or running any sort of video media will escalate the problem, however, it can freeze up randomly, between 30 mins to over 6 hours.

After compiling kernel 3.15.10 from kernel.org, the issue is gone. This seems to have started only from version 3.16 and onwards. Version 4.2.3, the latest kernel, STILL has not fixed this issue.

Intel, please do something about this. Some people might have need for the latest kernel (I do not, at the moment, but I'd rather not stick with an outdated kernel).
Comment 87 Deepak S 2015-10-26 03:47:30 UTC
@Jesse, Shall we enable legacy turbo on BYT until we have rootcause on BYT WA?
Also, Chris has submitted a cleaned up patch for "WA for Turbo and RC6 to work together" for review.
Comment 88 Michal Feix 2015-10-27 22:15:38 UTC
Just a quick confirmation.

I haven't seen no freeze while watching video in more than 4 hours now, when I tried Using the kernel option intel_pstate=disable.
Comment 89 ladiko 2015-10-28 07:37:54 UTC
And how much warmer does the CPU get?
Comment 90 Michal Feix 2015-10-28 15:37:43 UTC
It is quite hard for me to compare. Before I found this "magic" kernel parameter, my notebook was usualy frozen before CPU could get any warmer.. Since yesterday, I haven't seen more than 45C on all cores, while working in office apps or watching a movie. I guess, this temperature is not an issue on Celeron N2940.

Anyway - if I have to choose between reliable working notebook with a bit warmer CPU and randomly freezing notebook with calm CPU, I choose the first one for sure. ;-)
Comment 91 ladiko 2015-10-28 16:22:41 UTC
actually you have three options:

* current kernel --> freezes
* current kernel + pstates-parameter -> warmer cpu
* kernel 3.16 --> no issues
Comment 92 Chris Rainey 2015-10-29 21:47:53 UTC
3.16 working well on my DELL Inspiron 3646:

I've had little to no trouble ... even stressing the system using:

glmark2 --run-forever


I got my 3.16 kernel here:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt18-utopic/


I'm currently using Ubuntu 15.10 with the 3.16 kernel.


Hope this helps !!
Comment 93 Chris Rainey 2015-10-29 21:48:59 UTC
3.16 working well on my DELL Inspiron 3646:

I've had little to no trouble ... even stressing the system using:

glmark2 --run-forever


I got my 3.16 kernel here:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16.7-ckt18-utopic/


I'm currently using Ubuntu 15.10 with the 3.16 kernel.


Hope this helps !!
Comment 94 Kamil 2015-10-30 06:46:18 UTC
Currently I use kernel 3.16.0-4 (Debian Jessie default) and since that change (before I had kernel 4.2) I do not experience any system freeze. For now my computer (ASROCK SBC-211P, BYT CPU) is working second day without any crash. Before, I had Kernel 4.2 on which I had system freeze after 2min from system boot and playing video in VLC.
Comment 95 Tim 2015-10-30 13:29:34 UTC
Me too... random freezes with Gentoo on a Biostar J1900MH2.  Used as a HTPC/mythfrontend, so any kernel too old to provide audio over HDMI is not OK.  I have been testing different BIOS settings and kernel configs.  Currently running 4.2.4
Comment 96 John 2015-10-30 18:52:56 UTC
Same here with random freezes.  Tried intel_pstate=disabled which works.  However, cutting the max GPU frequency to about 50% also works for me.  Video seems smoother compared to pstate=disabled. YMMV

Running 64 bit Mint 17.2/Cinnamon on ASUS T100-CHI linux-4.2.5 w/Ubuntu base and T100 specific patches.  (Intel Atom Z3775, ValleyView Gen7)

Also seeing freezes on Dell Inspiron Laptop (Intel N3540) with various Ubuntu kernels from 4.3-rc7 back to 3.18.21, though usually much less than once a day.  

CHI without workarounds usually freezes within minutes to several hours.  With GPU capped, it runs as long as I let it - usually a few days.
Comment 97 ladiko 2015-10-30 18:55:29 UTC
How to limit the GPU frequency?
Comment 98 John 2015-10-30 18:59:45 UTC
Same here with random freezes.  Tried intel_pstate=disabled which works.  However, cutting the max GPU frequency to about 50% also works for me.  Video seems smoother compared to pstate=disabled. YMMV

Running 64 bit Mint 17.2/Cinnamon on ASUS T100-CHI linux-4.2.5 w/Ubuntu base and T100 specific patches.  (Intel Atom Z3775, ValleyView Gen7)

Also seeing freezes on Dell Inspiron Laptop (Intel N3540) with various Ubuntu kernels from 4.3-rc7 back to 3.18.21, though usually much less than once a day.  

CHI without workarounds usually freezes within minutes to several hours.  Software rendering improves run time/reduces freeze rate.  However, with GPU capped, it runs as long as I let it - usually a few days.
Comment 99 John 2015-10-30 19:09:39 UTC
To cap frequency I read the max (779 for mine) from 

cat /sys/class/drm/card0/gt_max_freq_mhz

To set pick a lower value (as root)

echo 423 > /sys/kernel/debug/dri/0/i915_max_freq

I tried lower values in 100 Mhz steps until I found stability (to 423 in my case).  

I think you could just put back to the gt_max but this worked for me.

This value resets each boot, and the driver rounds the value to something close.
Comment 100 Tim 2015-11-01 15:55:24 UTC
I have found a setting that controls the random freezes, at least on my board.  Disabling "IGD Turbo Enable" under NorthBridge options in the BIOS.  Otherwise, the BIOS is set to the defaults.

Different kernel .configs had no effect.  I have enabled all Baytrail options and boot from an EFI stub.
Comment 101 Luka Karinja 2015-11-02 08:40:43 UTC
Lowering i915_max_freq, even setting it to min still freezes my T100TAF (Atom Z3735).
I haven't experienced any freezes with pstate=disabled, but performance is really affected
Comment 102 John 2015-11-02 19:01:12 UTC
Given Luka Karinja's results, I checked my kernel args to see if something else could account for my results.  I found - i915.i915_enable_rc6=1 i915.lvds_downclock=1 i915.semaphores=1 i915.i915_enable_fbc=1.

rc6=1 seems to be known to add instability, perhaps the freq cap offset that.  I've stripped the args (except boot, splash, quiet) will be running new tests.

Kernel args I've been using the last few weeks on T100CHI.

boot=pci,force acpi=force rcutree.rcu_idle_gp_delay=1 libahci.ignore_sss=1 splash quiet acpi_enforce_resources=lax i915.i915_enable_rc6=1 i915.lvds_downclock=1 i915.semaphores=1 i915.i915_enable_fbc=1 drm.vblankoffdelay=1 pcie_aspm=force acpi=force rcutree.rcu_idle_gp_delay=1 libahci.ignore_sss=1 splash quiet acpi_enforce_resources=lax drm.vblankoffdelay=1 pcie_aspm=force
Comment 103 ladiko 2015-11-02 19:32:15 UTC
I still get rare freezes on ubuntu with linux-image-generic-lts-utopic (kernel 3.16.0). Does pstates=disabled only effect Intel-CPUs or AMDs as well? I am searching for a general setup that doesnt effect AMD-cpus but Intel Baytrail only.
Comment 104 ladiko 2015-11-02 19:32:43 UTC
I still get rare freezes on ubuntu with linux-image-generic-lts-utopic (kernel 3.16.0). Does pstates=disabled only effect Intel-CPUs or AMDs as well? I am searching for a general setup that doesnt effect AMD-cpus but Intel Baytrail only.
Comment 105 Jani Nikula 2015-11-03 11:13:57 UTC
(In reply to John from comment #102)
> Given Luka Karinja's results, I checked my kernel args to see if something
> else could account for my results.  I found - i915.i915_enable_rc6=1
> i915.lvds_downclock=1 i915.semaphores=1 i915.i915_enable_fbc=1.

i915.i915_enable_rc6 and i915.i915_enable_fbc have been renamed i915.enable_rc6 and i915.enable_fbc, respectively, since v3.15 so those have had no impact.

These days all of those are considered debug options, and we taint the kernel if they've been set.
Comment 106 John 2015-11-03 20:14:28 UTC
(In reply to Jani Nikula from comment #105)
<snip>
> i915.i915_enable_rc6 and i915.i915_enable_fbc have been renamed
> i915.enable_rc6 and i915.enable_fbc, respectively, since v3.15 so those have
> had no impact.
> 
> These days all of those are considered debug options, and we taint the
> kernel if they've been set.

Appreciate the info.  Retested: no args, no cap -> froze < 2 hours, reboot froze within 2 minutes. Frequency cap only, still running (25+ hrs.)

But it looks like I've been just rehashing comments 33-36, which also didn't work for everyone. Only difference is 50% cap vs. minimum cap. Improvement?
Comment 107 Kamil 2015-11-04 07:33:48 UTC
Every kernel above 3.16.x just fails.

3.16.x - no freeze
> 3.16.x - freezes no later than six hours after video launch. 

I checked many kernel versions: 3.16.x, 3.17.x, 3.18.x, 3.19.x, 4.0.x, 4.1.x, 4.2.x and latest 4.3. None of described above kernel parameters works.

For tests I used ASROCK SBC-211P (Baytrail-E3800).
Comment 108 Laszlo Fiat 2015-11-08 10:44:36 UTC
(In reply to John from comment #99)
> To cap frequency I read the max (779 for mine) from 
> 
> cat /sys/class/drm/card0/gt_max_freq_mhz
> 
> To set pick a lower value (as root)
> 
> echo 423 > /sys/kernel/debug/dri/0/i915_max_freq

I have a Z3735F baytrail tablet running Debian 8 with a 1 month old linux-next kernel. 

I've lowered the i915_max_freq to 345 MHz, and achieved stability that way. 
No freezes since then. The Z3735F GPU has a base freq of 311 MHz, so I am pretty close to that.

I have also patched the kernel source with a few baytrail sdhci related patches from: https://github.com/hadess/rtl8723bs/tree/master/patches
Comment 109 Andy Furniss 2015-11-08 11:06:24 UTC
(In reply to Andy Furniss from comment #81)

<snip>

> Recently needed to re-locate and while doing so updated to 4.1.10 = hard
> lock after 7 days uptime. The kernel was not the only difference as I
> attached a usb printer and so have usb module and cups running now, though
> the printer had been off for days when it locked.
> 
> Anyway I am back on 4.1.1 now (with printer) and will have to see how long
> it lasts to be sure whether the kernel or the printer (or the move!) was the
> cause.

Still up OK after 20 days back on 4.1.1.

Strange that 4.1.10 seems to be a regression, there don't seem to be any obvious power related i915 commits between the two. Though as I am headless I am not getting and i915 interrupts anyway, which makes me thing that there is some different CPU/IO related regression. In all the testing I did before when using GPU I never locked by just stressing CPU/IO until maybe just before I stopped testing when I could get "make modules_install" to reliably lock (as noted in a previous comment).
Comment 110 Michal Feix 2015-11-08 23:20:44 UTC
(In reply to Andy Furniss from comment #109)
> (In reply to Andy Furniss from comment #81)
> 
> Still up OK after 20 days back on 4.1.1.
> 
> Strange that 4.1.10 seems to be a regression, there don't seem to be any
> obvious power related i915 commits between the two. Though as I am headless
> I am not getting and i915 interrupts anyway, which makes me thing that there
> is some different CPU/IO related regression. In all the testing I did before
> when using GPU I never locked by just stressing CPU/IO until maybe just
> before I stopped testing when I could get "make modules_install" to reliably
> lock (as noted in a previous comment).

To make it even more strange - As I reported earlier, on kernel 4.2.3 my system was unusable. I've downgraded to LTS kernel 4.1.12 and had not a single issue since than. I'm running 4.1.12 sucessufully for more than a week now - not a single freeze. I don't even need any pstate=disable command args any more, which was necessary on 4.2.3 to survive more than few minutes. I haven't tested 4.1.10 though.
Comment 111 John 2015-11-11 02:42:50 UTC
The notes for 4.2.6 claim to fix one problem that causes GPU locks.  When I added the incremental patch set, the longest it ran was about an hour (usually it froze within 5 minutes.)  I had just stopped a 6 day run (24/7) on my (ASUS baytrail) T100 specific 4.2.5 kernel (no args, 50% GPU cap) (with sdhci patches)  The freezes in 4.2.6 now seem to be independent of GPU frequency for my setup.
Comment 112 swex 2015-11-11 17:45:07 UTC
I've got freezes on baytrail tablet ASUS Vivotab note 8 (m80ta). But for me it looks unrelated to i915. Even with nomodeset and rmmod i915 system hang after some random time. From minutes to several hours.
Comment 113 cffwet 2015-11-13 03:19:03 UTC
I have system freezes on ASRock Q1900-ITX with a kernel 3.19.31-generic on an Ubuntu distro. I upgraded to kernel 4.2.0-16-generic last month and recently to 4.2.0-18-generic. The system freezes got worse (less than 10 min watching videos).

I disabled hardware acceleration in all software with this option, like in my browsers. Further I edited the file /etc/default/acpi-support: I disabled suspend/hibernate handling in acpi-support by changing the line "SUSPEND_METHODS="dbus-pm dbus-hal pm-utils" to "SUSPEND_METHODS="none".

I don't get any freezes anymore, now for 24h for both kernels 3.19.31-generic and 4.2.0-18-generic with a lot of video playing. I didn't tested on kernel 4.2.0-16-generic.

I tested disabling hardware acceleration without changing the acpi-support file. And I tested disabling suspend/hibernate handling with hardware acceleration. In both cases I still got freezes but it seems less frequent. I needed both options disabled to get rid of all the freezes.
Comment 114 carl wolfgang 2015-11-17 22:09:49 UTC
On a zotac ci320 nano with ubuntu trusty server 14.04.3 LTS and kernel 
from openelec forum 3.19.1-legacy-turbo+ with yavdr
unstable installed and va-api-glx in softhddevice vdr plugin a kernel oops
left the following trace, maybe usefull because freezes normally don't leave
a trace in the logs,..

Nov 17 22:12:55 nano4 kernel: [ 4740.991238] ------------[ cut here ]------------
Nov 17 22:12:55 nano4 kernel: [ 4740.991365] WARNING: CPU: 3 PID: 134 at drivers/gpu/drm/i915/intel_pm.c:4492 valleyview_set_rps+0x167/0x1a0 [i915]()
Nov 17 22:12:55 nano4 kernel: [ 4740.991375] WARN_ON(val > dev_priv->rps.max_freq_softlimit)
Nov 17 22:12:55 nano4 kernel: [ 4740.991383] Modules linked in: msr(E) autofs4(E) rc_tt_1500(OE) ts2020(OE) m88ds3103(OE) i2c_mux(E) arc4(E) intel_rapl(E) intel_powerclamp(E) snd_hda_codec_hdmi(E) snd_hda_codec_realtek(E) snd_hda_codec_generic(E) kvm(E) crct10dif_pclmul(E) crc32_pclmul(E) dvb_usb_dw2102(OE) dvb_usb(OE) ghash_clmulni_intel(E) iwlmvm(E) cryptd(E) dvb_core(OE) snd_soc_rt5640(E) mac80211(E) media(OE) snd_hda_intel(E) snd_soc_rl6231(E) snd_hda_controller(E) snd_intel_sst_acpi(E) snd_intel_sst_core(E) snd_soc_sst_mfld_platform(E) snd_hda_codec(E) snd_soc_core(E) serio_raw(E) snd_compress(E) iwlwifi(E) btusb(E) snd_pcm_dmaengine(E) snd_hwdep(E) cfg80211(E) snd_pcm(E) snd_seq_midi(E) snd_seq_midi_event(E) ir_lirc_codec(OE) ir_xmp_decoder(OE) lirc_dev(OE) ir_mce_kbd_decoder(OE) mei_txe(E) iosf_mbi(E) ir_sharp_decoder(OE) mei(E) lpc_ich(E) shpchp(E) ir_sanyo_decoder(OE) snd_rawmidi(E) ir_sony_decoder(OE) ir_jvc_decoder(OE) ir_rc6_decoder(OE) ir_rc5_decoder(OE) snd_seq(E) ir_nec_decoder(OE) snd_seq_device(E) snd_timer(E) rc_rc6_mce(OE) nuvoton_cir(OE) rc_core(OE) 8250_fintek(E) snd(E) rfcomm(E) bnep(E) dw_dmac(E) dw_dmac_core(E) i2c_hid(E) hid(E) rfkill_gpio(E) soundcore(E) bluetooth(E) snd_soc_sst_acpi(E) 8250_dw(E) spi_pxa2xx_platform(E) i2c_designware_platform(E) i2c_designware_core(E) pwm_lpss_platform(E) mac_hid(E) pwm_lpss(E) i915(E) video(E) drm_kms_helper(E) nfsd(E) drm(E) auth_rpcgss(E) nfs_acl(E) i2c_algo_bit(E) nfs(E) lockd(E) grace(E) sunrpc(E) fscache(E) nct6775(E) hwmon_vid(E) coretemp(E) lp(E) parport(E) nls_iso8859_1(E) psmouse(E) r8169(E) mii(E) ahci(E) libahci(E) sdhci_acpi(E) sdhci(E)
Nov 17 22:12:55 nano4 kernel: [ 4740.991767] CPU: 3 PID: 134 Comm: kworker/3:2 Tainted: G           OE  3.19.1-legacy-turbo+ #1
Nov 17 22:12:55 nano4 kernel: [ 4740.991778] Hardware name: Motherboard by ZOTAC ZBOX-CI320NANO series/ZBOX-CI320NANO series, BIOS B219P026 05/19/2015
Nov 17 22:12:55 nano4 kernel: [ 4740.991859] Workqueue: events intel_gen6_powersave_work [i915]
Nov 17 22:12:55 nano4 kernel: [ 4740.991871]  ffffffffc06cb3c8 ffff88003655fcc8 ffffffff8179acb0 0000000000000000
Nov 17 22:12:55 nano4 kernel: [ 4740.991890]  ffff88003655fd18 ffff88003655fd08 ffffffff81073a7a ffff88003655fcf8
Nov 17 22:12:55 nano4 kernel: [ 4740.991908]  ffff880078550000 00000000000000d6 00000000000000d6 ffff880077acd000
Nov 17 22:12:55 nano4 kernel: [ 4740.991927] Call Trace:
Nov 17 22:12:55 nano4 kernel: [ 4740.991968]  [<ffffffff8179acb0>] dump_stack+0x45/0x57
Nov 17 22:12:55 nano4 kernel: [ 4740.991993]  [<ffffffff81073a7a>] warn_slowpath_common+0x8a/0xc0
Nov 17 22:12:55 nano4 kernel: [ 4740.992013]  [<ffffffff81073af6>] warn_slowpath_fmt+0x46/0x50
Nov 17 22:12:55 nano4 kernel: [ 4740.992111]  [<ffffffffc0620467>] valleyview_set_rps+0x167/0x1a0 [i915]
Nov 17 22:12:55 nano4 kernel: [ 4740.992202]  [<ffffffffc0621ecf>] intel_gen6_powersave_work+0xb2f/0x11b0 [i915]
Nov 17 22:12:55 nano4 kernel: [ 4740.992223]  [<ffffffff8108c6cf>] process_one_work+0x14f/0x400
Nov 17 22:12:55 nano4 kernel: [ 4740.992241]  [<ffffffff8108ce68>] worker_thread+0x118/0x510
Nov 17 22:12:55 nano4 kernel: [ 4740.992259]  [<ffffffff8108cd50>] ? rescuer_thread+0x3d0/0x3d0
Nov 17 22:12:55 nano4 kernel: [ 4740.992278]  [<ffffffff81092252>] kthread+0xd2/0xf0
Nov 17 22:12:55 nano4 kernel: [ 4740.992298]  [<ffffffff81092180>] ? kthread_create_on_node+0x180/0x180
Nov 17 22:12:55 nano4 kernel: [ 4740.992319]  [<ffffffff817a26fc>] ret_from_fork+0x7c/0xb0
Nov 17 22:12:55 nano4 kernel: [ 4740.992339]  [<ffffffff81092180>] ? kthread_create_on_node+0x180/0x180
Nov 17 22:12:55 nano4 kernel: [ 4740.992352] ---[ end trace 6a13023d6ab83790 ]---
Comment 115 peppedx 2015-11-19 15:20:34 UTC
It happens also to me (almost once a day) using on a fresh Ubuntu 15.10

-> Atom(TM) CPU E3845 @ 1.91GHz

-> Linux rehab-desktop 4.2.0-18-generic #22-Ubuntu SMP Fri Nov 6 18:25:50 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

-> Intel® Graphics Stack Release 2015Q3 for Linux*

But it alse also on a Mint 17.2 (14.04 based) with 3.16 and 3.19 kernels using either SNA and UXA accel method.
Comment 116 Juha Sievi-Korte 2015-11-21 10:02:12 UTC
Hi All,

Came across this when hunting random freezes / crashes on Acer B115 laptop. It started with upgrade to ubuntu 15.04 (14.x worked ok, haven't noted the kernel versions).

http://ubuntuforums.org/showthread.php?t=2284615&p=13313066#post13313066 My original post in here.

OpenSuSE with kernel 4.0.5 seemed to run fine, but it might be that I looked it at the wrong end, because 15.04 ubuntu crashed only when going to sleep so that is the thing I tried to track down. 15.10 now crashes randomly during desktop use - and same happens in OpenSuSE Tumbleweed with 4.3 kernel.

Crashes seem intermittent, might make days without freeze and then couple of nights back two freezes in a row, second one just couple of minutes after reboot. Only load was chromium showing couple of large web pages when crashes happened. Symptons are quite same described in many posts, no sysrq possible, only power off works.

I did already try intel_pstate=disable and that made the system freeze on screensaver after just few minutes of uptime. After that I've booted with debugging options enabled and fiddled a bit with clock frequency setting, and haven't managed to crash since - but I'm still only three days up. Tried to make it crash by playing couple of games and/or HD videos, no luck so far. But this is to be expected, 15.10 ubuntu could also run couple of weeks - which makes this painful as there seems to be no clear way of reproducing the issue.

Just it makes me think that is there something going on with timings at hardware level? What I did try was to lower the frequency setting just lightly, with quick testing it didn't seem to matter how much I touched it. Also I'm a bit puzzled about the setting, is the /sys/kernel/debug/dri/0/i915_max_freq value in MHz or something else, as in log it says: 

[26873.155419] [drm:valleyview_enable_rps] current GPU freq: 312 MHz (198)
[26873.155420] [drm:valleyview_enable_rps] setting GPU freq to 645 MHz (214)

And I think I saw this high values in log, even if I did set the frequency value to less than 400. Anyway, I'll update if I found anything else, this is annoying as it has been going on months now without a clear clue what is wrong with this laptop :)
Comment 117 John 2015-11-25 21:07:00 UTC
I've been running several days without a freeze on my 4.2.6 kernel.  I simply added intel_idle.max_cstate=1 to my kernel arguments, no other power arguments, and no more setting GPU frequency caps.  

intel_idle.max_cstate=0 was effective too, but my system ran warm (not hot) when idle.  At max_cstate=1 the case temperature seems normal to me.

I suspect that the cost of this work-around would be less battery run time.  But until the T100CHI has full hardware support in linux (no sound, no bluetooth...), I'm tethered to a powered hub anyway.

I've also tested versions of 4.1.13, 4.2.6, 4.3, even 4.4-rc1 without obvious side-effects.  4.4rc2 did freeze within minutes of booting, but 4.4-rcx has too many regressions (no wifi even on a dongle) to take that freeze seriously.

I also tried max_cstate=2 on my Dell laptop (baytrail) but that seemed to trigger a "not quite" freeze during a kernel build (fan speed malfunction typical of a freeze, but the build finished successfully.)  The subsequent power down crashed and the next boot was extremely difficult to start (press hold repeat).  I'm not going to try the remaining max cstates 3-6!  

This might suggest the freeze lies in handling cstates 2-6 starting after kernel-3.16.7.  But that assumes this bandaid lasts more than another week.
Comment 118 John 2015-11-30 19:33:35 UTC
(In reply to John from comment #117)
> I've been running several days without a freeze on my 4.2.6...<snip>.. 
>
Update:  I found info suggesting cstate limits of 0,1 & 6(default) are valid, maybe 3, but probably not 2.  

I had to boot my CHI into the OEM.  When I resumed linux, I omitted the cstate kernel argument, as a sanity check.  My 4.2.6 froze within 5 minutes (browsing internet eagle cam).  Otherwise, still no freezes when I set intel_idle.max_cstate=1.  (~10 days so far)  

It looks like I can reproduce one type of freeze readily, so if y'all have [baytrail cstate management] 4.2.x patches to beta test, let me know. I can also test 4.1.1x or 4.3.x patches but those freeze rates are less "dependable."  I won't test 4.4-rcx until wifi (or USB wifi dongle) starts working again in the stock kernel.
Comment 119 Chris Rainey 2015-12-03 16:43:47 UTC
Confirming that "intel_idle.max_cstate=1" has solved my complete freeze issues on Bay Trail running Linux 4.1.13(Slackware64-current(pre-4.2) formerly running Ubuntu 15.04/15.10 with stock kernels).

Thanx for all the hard-work and long-efforts to see this through!
Comment 120 Martin W. 2015-12-04 21:29:18 UTC
I can also confirm that "intel_idle.max_cstate=1" has solved my complete freeze issues on Bay Trail (Celeron J1900) running Linux 4.2.5 (Arch Linux).

Before I got complete freeze when playing video using Kodi or VLC, browsing using Chrome etc. Freeze happened randomly, sometimes within 5 minutes of boot, other the computer would be stable for hours.

With "intel_idle.max_cstate=1" the computer has been stable for more than two days straight now playing videos, music, browsing using Chrome, playing some games etc.

Thanks John for the tip!
Comment 121 ladiko 2015-12-05 08:54:50 UTC
I tried all ubuntu 14.04 LTS kernels from 3.13 over 3.16, 3.19 to 4.2 and got freezes with all of them except for 3.13. All which produced freezes have been tried with all mentioned kernel parameters and verified it with cat /proc/cmdline. Kernel 4.2 + intel_idle.max_cstate=1 froze within 1 day.

We are running almost 200 machines with a identical setup of ubuntu 14.04 + xfce4 + chromium + html5-kiosk web application which includes an ogm video which is played when idle and otherwise some hardware accelerated html5 animations. 50 of the machines were supported by an Celeron J1900, the remains are equiped with older Core 2 Duo / Pentium Dual-Core or Celeron 847 and ~20 AMD E1-2100 or A4-5000. The most stable kernel for us is the default Ubuntu 14.04 kernel 3.13. We're going to buy AMD Kabinis as we dont have any issues there except the higher TDP and higher temperatures in a complete passively cooled system.
Comment 122 John 2015-12-06 08:41:41 UTC
I was surprised to experience a freeze while running Android_x86 4.4-rc3 on my 2 in 1 laptop.  After digging a bit - I found that the android_x86 runs on a custom linux-4.0.8. There wasn't a cstate argument in the command line.  Too soon to know if it will help, but I no longer get the "unfortunately," my app "has stopped running" warning when I try to launch an app with wifi off.

As ladiko points out, it is curious that AMD machines seem to be exempt from these freezes.  I have a dual boot AMD laptop mainly running Mint (linux 3.16.0-38-generic) for about 6 months.  The only problems I've had with it were related to the old hard drive starting to fail.  The kernel might be too old to freeze, though.
Comment 123 Peter Frühberger 2015-12-06 08:46:26 UTC
This bug has nothing to do with AMD machines ... that's just noise. It's still the same for everyone. Forcing the kernel to max cstate 1 or setting that via the bios solves the issue reliable.

We have some good experience with: https://github.com/fritsch/OpenELEC.tv/blob/jarvis-egl/packages/linux/patches/4.3/linux-999-i915-use-legacy-turbo.patch

Besides that - this bug got really, really silent concerning fixes.
Comment 124 ceric 2015-12-06 19:30:38 UTC
I've got the pentium n3540 on my asus laptop. I made fresh install this afternoon of ubuntu daily build (16.04).And it use kernel 4.3.0-2. No freeze at this time after one afternoon lighten. I listen music with rhythmbox and navigate on network.
Comment 125 John 2015-12-06 20:02:34 UTC
My apologies Mr. Frühberger , I see that I've once again re-discovered an already existing work around.  In the first post for this bug, you revealed the cstate workaround, almost a year ago.

I've tried your patch on my freeze prone 4.2.6.  It did last longer (25 minutes vs. 5 vs.)  The patch looks valid all the way back to 3.18, the oldest project directory I have.  I suspect on my 4.2.5 kernel, the patch would appear to be freeze-less.
Comment 126 Daniel Vetter 2015-12-08 09:51:42 UTC
(In reply to Chris Rainey from comment #119)
> Confirming that "intel_idle.max_cstate=1" has solved my complete freeze
> issues on Bay Trail running Linux 4.1.13(Slackware64-current(pre-4.2)
> formerly running Ubuntu 15.04/15.10 with stock kernels).
> 
> Thanx for all the hard-work and long-efforts to see this through!

Hm, sounds like after over a year of random walking multiple people have nailed this to cpu cstates, and the gpu driver changing behaviour slightly was just the canary in the coal mine here.

I tried to read through all comments here (gosh is there a lot of that) and didn't find anything to contradict that.

Given that I filed a new bug report on bugzilla.kernel.org:

https://bugzilla.kernel.org/show_bug.cgi?id=109051

Everyone please jump over there to that bug and fill in with your details/summary.

Thanks, Daniel
Comment 127 Mika Kuoppala 2015-12-17 15:15:44 UTC
Created attachment 120563 [details] [review]
drm/i915/vlv: Take forcewake on media engine writes
Comment 128 Luka Karinja 2015-12-17 21:08:18 UTC
(In reply to Mika Kuoppala from comment #127)
> Created attachment 120563 [details] [review] [review]
> drm/i915/vlv: Take forcewake on media engine writes

what kernel version should be used? tried aplying to 4.4rc5 and 4.3.3 with build errors
Comment 129 Mika Kuoppala 2015-12-18 13:04:30 UTC
Created attachment 120584 [details] [review]
drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes
Comment 130 John 2015-12-18 20:06:59 UTC
(In reply to Mika Kuoppala from comment #129)
> Created attachment 120584 [details] [review] [review]
> drm/i915/vlv: [V4.3 backport] Take forcewake on media engine writes

Thanks for the backport.  Without cstate arg, I had a freeze within a few minutes.  With cstate arg and patch no problems.  The justification for the patch seems quite reasonable, it just doesn't affect freezing on my setup (ASUS T100-CHI Mint17.2/Cinnamon).  I'll try the patch with other kernels for Mint and Manjaro.
Comment 131 Veronica 2016-03-25 00:24:32 UTC
Hello, I've been having this same issue of full system hang/freeze in my Asus Chromebox (Haswell) since I got it.
I've tried multiple xbuntu distros, kodibuntu and OpenElec and in all of them I always had system freezes, mostly while watching videos in Kodi but also while in desktop or watching videos in browser (YouTube, Netflix).
Everytime I've had to go back to Windows, no problem there, right now booting Win 10 off external HDD and GalliumOS (Based on Ubuntu 15.04 with default kernel) from internal SSD.

I too can't believe why this bug hasn't been fixed yet and honestly I don't understand what is the final fix/workaround for this bug.
Some people claim the cstate arg work but for others don't work.

Can someone please provide me a link to latest patched and working kernel version so I can test. I read all comments but its very confusing, there is no clear resolution here.

T.I.A
Comment 132 Veronica 2016-03-25 00:33:03 UTC
Freeze while watching video in YouTube, video freezes but audio is in a loop. Total system hang, force reboot necessary.

https://youtu.be/uSXXRf9t1E0
Comment 133 John 2016-03-25 01:25:41 UTC
(In reply to Veronica from comment #131)
> I too can't believe why this bug hasn't been fixed yet and honestly I don't
> understand what is the final fix/workaround for this bug.
> Some people claim the cstate arg work but for others don't work.
> 
> Can someone please provide me a link to latest patched and working kernel
> version so I can test. I read all comments but its very confusing, there is
> no clear resolution here.
> 
> T.I.A

The bug has been moved (but not fixed) to https://bugzilla.kernel.org/show_bug.cgi?id=109051  Over 200 additional comments, last 40 have some new ideas.

cstate works for a many, but not all.
Comment 134 Veronica 2016-03-25 04:32:03 UTC
(In reply to John from comment #133)
> (In reply to Veronica from comment #131)
> > I too can't believe why this bug hasn't been fixed yet and honestly I don't
> > understand what is the final fix/workaround for this bug.
> > Some people claim the cstate arg work but for others don't work.
> > 
> > Can someone please provide me a link to latest patched and working kernel
> > version so I can test. I read all comments but its very confusing, there is
> > no clear resolution here.
> > 
> > T.I.A
> 
> The bug has been moved (but not fixed) to
> https://bugzilla.kernel.org/show_bug.cgi?id=109051  Over 200 additional
> comments, last 40 have some new ideas.
> 
> cstate works for a many, but not all.

Thank you for that link, reading it and will report there after testing in my Chromebox.
Comment 135 Jani Nikula 2016-03-29 10:24:39 UTC
(In reply to Daniel Vetter from comment #126)
> (In reply to Chris Rainey from comment #119)
> > Confirming that "intel_idle.max_cstate=1" has solved my complete freeze
> > issues on Bay Trail running Linux 4.1.13(Slackware64-current(pre-4.2)
> > formerly running Ubuntu 15.04/15.10 with stock kernels).
> > 
> > Thanx for all the hard-work and long-efforts to see this through!
> 
> Hm, sounds like after over a year of random walking multiple people have
> nailed this to cpu cstates, and the gpu driver changing behaviour slightly
> was just the canary in the coal mine here.
> 
> I tried to read through all comments here (gosh is there a lot of that) and
> didn't find anything to contradict that.
> 
> Given that I filed a new bug report on bugzilla.kernel.org:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=109051
> 
> Everyone please jump over there to that bug and fill in with your
> details/summary.
> 
> Thanks, Daniel

RESOLVED MOVED again.
Comment 136 Jani Nikula 2016-09-20 12:19:59 UTC
*** Bug 93214 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.