Bug 83677 - [HSW gt1] GPU HANG: ecode 0:0x87d3bffa on ctx load
Summary: [HSW gt1] GPU HANG: ecode 0:0x87d3bffa on ctx load
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: All All
: highest blocker
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 78983 80229 85503 85765 86670 87045 87176 87571 88017 88044 88341 88604 88612 88839 89010 89025 89065 89089 89183 89531 89799 89964 90165 90509 90635 90659 90729 91024 91144 91932 91955 92647 92763 93756 95084 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-09-09 15:18 UTC by Simon Farnsworth
Modified: 2017-07-21 16:49 UTC (History)
51 users (show)

See Also:
i915 platform: ALL
i915 features:


Attachments
Error state collected during hang (449.08 KB, application/gzip)
2014-09-09 15:19 UTC, Simon Farnsworth
no flags Details
Make the context switch+dispatch uninterruptible (968 bytes, patch)
2014-09-18 13:02 UTC, Chris Wilson
no flags Details | Splinter Review
The error state after applying the patch from comment #15 (443.76 KB, application/octet-stream)
2014-09-18 15:15 UTC, Simon Farnsworth
no flags Details
Error state after patch from comment #24 is applied (444.94 KB, application/octet-stream)
2014-09-18 17:05 UTC, Simon Farnsworth
no flags Details
error state gzip with #requests (457.02 KB, application/octet-stream)
2014-09-22 11:09 UTC, Simon Farnsworth
no flags Details
Error state with invalidate after context switch (457.02 KB, application/octet-stream)
2014-10-10 14:53 UTC, Simon Farnsworth
no flags Details
Backport of requests and PPGTT changes to 3.17.0 (886.33 KB, patch)
2014-10-18 14:40 UTC, Simon Farnsworth
no flags Details | Splinter Review
chrashlog on Google chromebox (1.55 MB, application/octet-stream)
2014-10-23 18:07 UTC, Peter Frühberger
no flags Details
dmesg output after suspend/resume (246.56 KB, text/plain)
2014-10-26 16:19 UTC, Hugh Greenberg
no flags Details
dump after error with i915.enable_ppgtt=0 (483.49 KB, application/octet-stream)
2014-11-05 18:05 UTC, Hugh Greenberg
no flags Details
possible hang fix (797 bytes, patch)
2014-11-06 05:55 UTC, Hugh Greenberg
no flags Details | Splinter Review
gpu hang error with Greenberg patch (709.91 KB, application/octet-stream)
2014-11-06 10:30 UTC, Peter Frühberger
no flags Details
dmesg 3.17.2 + Greenberg patch (59.43 KB, text/plain)
2014-11-06 10:31 UTC, Peter Frühberger
no flags Details
GPU-hang dmesg output on Pentium G3420 using OpenELEC 4.2.1/XBMC (84.25 KB, text/plain)
2014-11-09 17:16 UTC, M. Kramer
no flags Details
1037U kernel BUG traceback in i915 code (1.99 MB, image/jpeg)
2014-11-12 10:50 UTC, Barry Scott
no flags Details
Force a CS stall inside gen7 invalidate-caches (1.52 KB, patch)
2014-11-14 12:14 UTC, Chris Wilson
no flags Details | Splinter Review
Add extra flush flags for gen7 invalidate (1.06 KB, patch)
2014-12-10 20:56 UTC, Chris Wilson
no flags Details | Splinter Review
Keep GPU awake for context switches (4.46 KB, patch)
2014-12-10 20:57 UTC, Chris Wilson
no flags Details | Splinter Review

Description Simon Farnsworth 2014-09-09 15:18:08 UTC
I'm trying to persuade a Haswell G1820TE to run stable for long periods of time, but I keep getting GPU hangs.

[  414.334623] [drm] stuck on render ring
[  414.336340] [drm] GPU HANG: ecode 0:0x87d3bffa, in screen_manager [878], reason: Ring hung, action: reset
[  414.336351] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  414.336356] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  414.336360] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  414.336365] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  414.336370] [drm] GPU crash dump saved to /sys/class/drm/card0/error

I've tried using i915.use_mmio_flip=1, but still get the hang.

What am I doing wrong?

Device is Intel(R) Celeron(R) CPU G1820TE @ 2.20GHz.

Packages:

 * kernel-3.17.0-0.rc4.git0.1.fc22.x86_64
 * libdrm-2.4.54-1.fc20.x86_64
 * mesa-dri-drivers-10.1.5-1.20140607.fc20.x86_64
 * xorg-x11-drv-intel-2.21.15-7.fc20.x86_64
 * xorg-x11-server-Xorg-1.14.4-5.fc20.x86_64

I can add more information if needed - I'll attach a gzip compressed version of the error state from when I has use_mmio_flip=1
Comment 1 Simon Farnsworth 2014-09-09 15:19:08 UTC
Created attachment 105992 [details]
Error state collected during hang
Comment 2 Simon Farnsworth 2014-09-11 17:13:38 UTC
I can repro this reliably, with only X11 and the compositor accessing the GPU; the application drawing (Adobe Flash in the repro case) is using X11 to draw.

Moving to xorg-x11-drv-intel-2.99.916-2.fc21 with SNA instead of UXA didn't help, nor did enabling triple buffering.
Comment 3 Chris Wilson 2014-09-11 20:12:27 UTC
It's the switch into the compositors context is where it dies (and in all the similar bugs it is the switch into the GL context). One hypothesis is that something in the context state saved from GL is corrupt or plain invalid upon restore. An alternative shotgun: http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests
Comment 4 Simon Farnsworth 2014-09-17 10:00:39 UTC
(In reply to comment #3)
> It's the switch into the compositors context is where it dies (and in all
> the similar bugs it is the switch into the GL context). One hypothesis is
> that something in the context state saved from GL is corrupt or plain
> invalid upon restore. An alternative shotgun:
> http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests

Applying the shotgun fixed it - at least one of the pellets must have hit the bug between the eyes.

How do you want me to proceed from here? I need a patchset that meets the rules for stable kernels (I'm trying to stick as closely as possible to Fedora's kernels here). I've got the usual tools to hand (repro case, git, RPM building tools etc), so can work with you to get a suitable tested patchset.
Comment 5 Chris Wilson 2014-09-17 10:24:30 UTC
That's a bit scary then. Which commit did I point you at so I can find the parent drm-intel-nightly commit (to narrow the shotgun down a bit)?
Comment 6 Simon Farnsworth 2014-09-17 11:32:47 UTC
I built and tested:

: sfarnsworth host64  $ git show
commit 3a5e1e6176fb61735a98f16a80c756b3cc69f125
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Aug 24 19:34:16 2014 +0100

    drm/i915: Convert a couple more INTEL_INFO-esque macros to be pointer agnostic
    
    Just a couple more macros that assume that they were being passed a
    struct drm_device when they want a struct drm_i915_private. Use our
    magic macro to ease transitioning over to using drm_i915_privates
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h
index 5cadfa5..d1678e2 100644
--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2046,7 +2046,7 @@ struct drm_i915_cmd_table {
 #define HAS_VEBOX(dev)         (INTEL_INFO(dev)->ring_mask & VEBOX_RING)
 #define HAS_LLC(dev)           (INTEL_INFO(dev)->has_llc)
 #define HAS_WT(dev)            ((IS_HASWELL(dev) || IS_BROADWELL(dev)) && \
-                                to_i915(dev)->ellc_size)
+                                __I915__(dev)->ellc_size)
 #define I915_NEED_GFX_HWS(dev) (INTEL_INFO(dev)->need_gfx_hws)
 
 #define HAS_HW_CONTEXTS(dev)   (INTEL_INFO(dev)->gen >= 5)
@@ -2100,7 +2100,7 @@ struct drm_i915_cmd_table {
 #define INTEL_PCH_LPT_DEVICE_ID_TYPE           0x8c00
 #define INTEL_PCH_LPT_LP_DEVICE_ID_TYPE                0x9c00
 
-#define INTEL_PCH_TYPE(dev) (to_i915(dev)->pch_type)
+#define INTEL_PCH_TYPE(dev) (__I915__(dev)->pch_type)
 #define HAS_PCH_LPT(dev) (INTEL_PCH_TYPE(dev) == PCH_LPT)
 #define HAS_PCH_CPT(dev) (INTEL_PCH_TYPE(dev) == PCH_CPT)
 #define HAS_PCH_IBX(dev) (INTEL_PCH_TYPE(dev) == PCH_IBX)
Comment 7 Chris Wilson 2014-09-17 11:38:22 UTC
The first commit to check is then 257d90d13794c2eb545ab0d6c708f21e2a0378b6. That will tell us if the fix is in my shotgun branch or upstream. My guess is that it is in this branch, in which case you have two points from which to start bisecting. I have a few guesses, it might well be one of the minor patches...
Comment 8 Simon Farnsworth 2014-09-17 13:48:23 UTC
(In reply to comment #7)
> The first commit to check is then 257d90d13794c2eb545ab0d6c708f21e2a0378b6.
> That will tell us if the fix is in my shotgun branch or upstream. My guess
> is that it is in this branch, in which case you have two points from which
> to start bisecting. I have a few guesses, it might well be one of the minor
> patches...

That commit did not work - I get my GPU hangs.

I'll start bisecting.
Comment 9 Simon Farnsworth 2014-09-17 14:09:15 UTC
I think I done wrong. It looks like I tried your *master* branch, not your *requests* branch, and bisect won't work:

: sfarnsworth host64  $ git bisect start

: sfarnsworth host64  $ git bisect good 3a5e1e6176fb61735a98f16a80c756b3cc69f125

: sfarnsworth host64  $ git bisect bad 257d90d13794c2eb545ab0d6c708f21e2a0378b6
Some good revs are not ancestor of the bad rev.
git bisect cannot work properly in this case.
Maybe you mistake good and bad revs?
Comment 10 Chris Wilson 2014-09-17 15:31:26 UTC
(In reply to comment #9)
> I think I done wrong. It looks like I tried your *master* branch, not your
> *requests* branch, and bisect won't work:
> 
> : sfarnsworth host64  $ git bisect start
> 
> : sfarnsworth host64  $ git bisect good
> 3a5e1e6176fb61735a98f16a80c756b3cc69f125
> 
> : sfarnsworth host64  $ git bisect bad
> 257d90d13794c2eb545ab0d6c708f21e2a0378b6
> Some good revs are not ancestor of the bad rev.
> git bisect cannot work properly in this case.
> Maybe you mistake good and bad revs?

It's just that git is very ethical and doesn't have a loose definition of good and bad that we do. In its opinion old code is always good and bugs are only ever introduced. To get around this you have to do a "reverse git bisect" and declare good as bad and vice versa.

i.e.

git bisect start
git bisect good 257d90d13794c2eb545ab0d6c708f21e2a0378b6
git bisect bad 3a5e1e6176fb61735a98f16a80c756b3cc69f125

then hang -> git bisect good, working -> git bisect bad.

I wish git bisect had a switch for that so that you didn't have to run the risk of mixing up good/bad on each step.
Comment 11 Simon Farnsworth 2014-09-18 11:36:00 UTC
I used "git bisect good" for GPU hangs, "git bisect bad" for "it works", and "git bisect skip" for "compiler says no kernel for you."

Assuming no mistakes, I get:

: sfarnsworth host64  $ git bisect log
git bisect start
# good: [257d90d13794c2eb545ab0d6c708f21e2a0378b6] drm-intel-nightly: 2014y-08m-21d-10h-03m-09s integration manifest
git bisect good 257d90d13794c2eb545ab0d6c708f21e2a0378b6
# bad: [3a5e1e6176fb61735a98f16a80c756b3cc69f125] drm/i915: Convert a couple more INTEL_INFO-esque macros to be pointer agnostic
git bisect bad 3a5e1e6176fb61735a98f16a80c756b3cc69f125
# skip: [9abf49b9962e1fe5d30ac1cf32e8cc2272d531c4] intel-gtt: Report stolen_size as 0 when local memory is present
git bisect skip 9abf49b9962e1fe5d30ac1cf32e8cc2272d531c4
# skip: [30b824d88baa6b1a23e189c3c06ecf32e8cf0cbf] drm/i915: Reduce number of register access during IVB+ interrupt handling
git bisect skip 30b824d88baa6b1a23e189c3c06ecf32e8cf0cbf
# skip: [d33c3d9e218a8c96e6a15cc4b558b2b7780fe134] drm/i915: Check the minimum pitch for the user framebuffer
git bisect skip d33c3d9e218a8c96e6a15cc4b558b2b7780fe134
# skip: [dfd9d929b9a66d5ed9bfffc0335fc11293451290] drm/i915/sdvo: Fix LVDS connector status detection
git bisect skip dfd9d929b9a66d5ed9bfffc0335fc11293451290
# bad: [20ae302941850d0b3e00f6cbdc88d2824585f112] drm/i915: Improved w/a for rps on Baytrail
git bisect bad 20ae302941850d0b3e00f6cbdc88d2824585f112
# good: [555633d6527465a77845a9d705cd2075ccbdeef0] drm/i915: Remove DRI1 ring accessors and API
git bisect good 555633d6527465a77845a9d705cd2075ccbdeef0
# bad: [01094f706a41793d8708592e1925960370f83e05] drm/i915: Decouple the stuck pageflip on modeset
git bisect bad 01094f706a41793d8708592e1925960370f83e05
# bad: [03e2e353953fdd6627a0864be0e3c223762bd85c] drm/i915: Prevent recursive deadlock on releasing a busy userptr
git bisect bad 03e2e353953fdd6627a0864be0e3c223762bd85c
# skip: [6196a504b501a7e3ed6e740913243c2d2d070c21] drm/i915: Renames variables and functions that act upon intel_engine_cs
git bisect skip 6196a504b501a7e3ed6e740913243c2d2d070c21
# bad: [6fd4781d6c60795ab43180cdc081532054214fe7] drm/i915: s/seqno/request/ tracking inside objects
git bisect bad 6fd4781d6c60795ab43180cdc081532054214fe7
# only skipped commits left to test
# possible first bad commit: [6fd4781d6c60795ab43180cdc081532054214fe7] drm/i915: s/seqno/request/ tracking inside objects
# possible first bad commit: [6196a504b501a7e3ed6e740913243c2d2d070c21] drm/i915: Renames variables and functions that act upon intel_engine_cs
Comment 12 Chris Wilson 2014-09-18 11:39:02 UTC
That indicates the shotgun helps. :| Oh well.
Comment 13 Chris Wilson 2014-09-18 12:01:38 UTC
I've updated the shotgun at #requests. It's been reworked quite a bit since then, and I need to double check that it still applies. I think I have a germ of a theory as to what is going wrong.
Comment 14 Simon Farnsworth 2014-09-18 12:55:57 UTC
(In reply to comment #13)
> I've updated the shotgun at #requests. It's been reworked quite a bit since
> then, and I need to double check that it still applies. I think I have a
> germ of a theory as to what is going wrong.

I'm now testing teh requests branch, as of

commit da0c726483f60d4f53de49a4a2753a1d95983bd9
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 18 12:54:55 2014 +0100

    Revert "drm/i915: Enable full PPGTT on gen7"
    
    This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1.

This gets me a new bit of excitement - when I start X for the first time, the log file says:

[    72.039] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
[    72.039] (WW) xf86OpenConsole: setsid failed: Operation not permitted
[    72.039] (EE) 
Fatal server error:
[    72.039] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error
[    72.039] (EE) 
[    72.039] (EE) 

which it didn't before. Second attempt to start X works fine.

I also get a new message (but not reliably) in dmesg:

[  255.476493] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle

And I've had the machine freeze completely while restarting X11.
Comment 15 Chris Wilson 2014-09-18 13:02:27 UTC
Created attachment 106501 [details] [review]
Make the context switch+dispatch uninterruptible

This should test my theory that is a signal between setting the context and executing the batch that is causing the error. Slightly too coarse, but it should point if I am in the right direction.

(Still would like confirmation on the current #requests shotgun :)
Comment 16 Chris Wilson 2014-09-18 13:11:28 UTC
(In reply to comment #14)
> (In reply to comment #13)
> > I've updated the shotgun at #requests. It's been reworked quite a bit since
> > then, and I need to double check that it still applies. I think I have a
> > germ of a theory as to what is going wrong.
> 
> I'm now testing teh requests branch, as of
> 
> commit da0c726483f60d4f53de49a4a2753a1d95983bd9
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Sep 18 12:54:55 2014 +0100
> 
>     Revert "drm/i915: Enable full PPGTT on gen7"
>     
>     This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1.
> 
> This gets me a new bit of excitement - when I start X for the first time,
> the log file says:
> 
> [    72.039] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
> [    72.039] (WW) xf86OpenConsole: setsid failed: Operation not permitted
> [    72.039] (EE) 
> Fatal server error:
> [    72.039] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error
> [    72.039] (EE) 
> [    72.039] (EE) 
> 
> which it didn't before. Second attempt to start X works fine.

Right, there is a nasty bug in the vt layer somewhere. The branch contains a patch to return -EIO to prevent a lockup. But you should see that regardless, just depends on kernel config.
 
> I also get a new message (but not reliably) in dmesg:
> 
> [  255.476493] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer
> elapsed... blitter ring idle
> 
> And I've had the machine freeze completely while restarting X11.

These two are more concerning. On HSW. Hmm.
Comment 17 Simon Farnsworth 2014-09-18 13:18:33 UTC
FWIW, those are both reproduceable. I get the hangcheck message first, then later an X11 restart will take out the system (no local or remote access works).
Comment 18 Simon Farnsworth 2014-09-18 14:16:15 UTC
(In reply to comment #15)
> Created attachment 106501 [details] [review] [review]
> Make the context switch+dispatch uninterruptible
> 
> This should test my theory that is a signal between setting the context and
> executing the batch that is causing the error. Slightly too coarse, but it
> should point if I am in the right direction.
> 
> (Still would like confirmation on the current #requests shotgun :)

A base 3.17-rc5 with this patch applied has GPU hangs. I've grabbed the error state if it would be interesting.
Comment 19 Chris Wilson 2014-09-18 15:13:40 UTC
(In reply to comment #18)
> (In reply to comment #15)
> > Created attachment 106501 [details] [review] [review] [review]
> > Make the context switch+dispatch uninterruptible
> A base 3.17-rc5 with this patch applied has GPU hangs. I've grabbed the
> error state if it would be interesting.

Please do, I expect it to be the same error, but we should check anyway.
Comment 20 Simon Farnsworth 2014-09-18 15:15:08 UTC
Created attachment 106508 [details]
The error state after applying the patch from comment #15
Comment 21 Chris Wilson 2014-09-18 15:49:33 UTC
(In reply to comment #20)
> Created attachment 106508 [details]
> The error state after applying the patch from comment #15

For the record, it is the same bug.
Comment 22 Chris Wilson 2014-09-18 15:53:24 UTC
If you have time, could you checkout c5cddc3c051057c11ea739744ed03d284ce0d0f3^ and see if that starts up ok? (Also if you have a netconsole for grabbing the oops from that lockup that would be very useful.)
Comment 23 Chris Wilson 2014-09-18 16:06:05 UTC
Maybe:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ad55b06a3cb1..9509f04c57b6 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1351,10 +1351,8 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
        mutex_unlock(&dev->struct_mutex);
        ret = __wait_seqno(ring, seqno, reset_counter, true, NULL, file_priv);
        mutex_lock(&dev->struct_mutex);
-       if (ret)
-               return ret;
 
-       return i915_gem_object_wait_rendering__tail(obj, ring);
+       return 0;
 }
 
 /**
Comment 24 Chris Wilson 2014-09-18 16:06:29 UTC
Or rather:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index ad55b06a3cb1..97089c392094 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1351,10 +1351,8 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj,
        mutex_unlock(&dev->struct_mutex);
        ret = __wait_seqno(ring, seqno, reset_counter, true, NULL, file_priv);
        mutex_lock(&dev->struct_mutex);
-       if (ret)
-               return ret;
 
-       return i915_gem_object_wait_rendering__tail(obj, ring);
+       return ret;
 }
 
 /**
Comment 25 Simon Farnsworth 2014-09-18 16:25:54 UTC
(In reply to comment #22)
> If you have time, could you checkout
> c5cddc3c051057c11ea739744ed03d284ce0d0f3^ and see if that starts up ok?
> (Also if you have a netconsole for grabbing the oops from that lockup that
> would be very useful.)

c5cddc3c051057c11ea739744ed03d284ce0d0f3^ starts up. netconsole gives me:

[  271.900969] ------------[ cut here ]------------
[  271.901001] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:130!
[  271.901026] invalid opcode: 0000 [#1] SMP 
[  271.901048] Modules linked in: netconsole dummy nf_conntrack_ipv4 ip6t_REJECT nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack cfg80211 nf_conntrack ip6table_filter ip6_tables rfkill snd_dummy x86_pkg_temp_thermal coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec kvm snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer crct10dif_pclmul crc32_pclmul iTCO_wdt mei_me mei crc32c_intel iTCO_vendor_support snd ghash_clmulni_intel mxm_wmi tpm_tis lpc_ich tpm r8169 serio_raw pcspkr mii mfd_core i2c_i801 microcode soundcore wmi shpchp i915 i2c_algo_bit drm_kms_helper drm video
[  271.901506] CPU: 0 PID: 1602 Comm: screen_manager Not tainted 3.17.0-rc5+ #9
[  271.901533] Hardware name: ONELAN MS-7851/B85I (MS-7851), BIOS V3.5 05/30/2014
[  271.901560] task: ffff8800d43d4a00 ti: ffff8800a02f4000 task.ti: ffff8800a02f4000
[  271.901588] RIP: 0010:[<ffffffffa00a166d>]  [<ffffffffa00a166d>] i915_gem_object_retire__read+0x16d/0x170 [i915]                                                                                  
[  271.901650] RSP: 0018:ffff8800a02f7c78  EFLAGS: 00010246                                                                                                                                          
[  271.901671] RAX: ffff8800d475e900 RBX: ffff8800362e18e0 RCX: dead000000200200                                                                                                                     
[  271.901698] RDX: 0000000000000140 RSI: ffff8800362e18e0 RDI: ffff8800d2fa26c0                                                                                                                     
[  271.901724] RBP: ffff8800a02f7ca0 R08: ffff8800a02594f8 R09: ffff88011da173c0                                                                                                                     
[  271.901750] R10: ffffea000351d780 R11: ffffffffa00b19d8 R12: ffff8800362e1a90                                                                                                                     
[  271.901777] R13: 0000000000000001 R14: ffff8800362e0000 R15: ffff8800d2fa26c0                                                                                                                     
[  271.901803] FS:  00007fe6bebc0700(0000) GS:ffff88011da00000(0000) knlGS:0000000000000000                                                                                                          
[  271.901833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033                                                                                                                                     
[  271.901855] CR2: 00007fe6b8097138 CR3: 00000000ace05000 CR4: 00000000000407f0                                                                                                                     
[  271.901881] Stack:                                                                                                                                                                                
[  271.901892]  ffff8800362e18e0 ffff8800362e1a90 0000000000000001 ffff8800362e0000                                                                                                                  
[  271.901926]  ffff8800d50e4078 ffff8800a02f7cc0 ffffffffa00a1ea8 ffff8800362e18e0
[  271.901960]  0000000000000005 ffff8800a02f7cf0 ffffffffa00a1f99 0000000000000000
[  271.901994] Call Trace:
[  271.902091]  [<ffffffffa00a1ea8>] i915_gem_retire_requests__engine+0x58/0x110 [i915]
[  271.902133]  [<ffffffffa00a1f99>] i915_gem_retire_requests+0x39/0x90 [i915]
[  271.902172]  [<ffffffffa00a209d>] i915_gem_object_retire+0xad/0x220 [i915]
[  271.902212]  [<ffffffffa00a2241>] i915_gem_object_wait_rendering.part.36+0x31/0x70 [i915]
[  271.902253]  [<ffffffffa00a3574>] i915_gem_object_set_to_cpu_domain+0x84/0x1d0 [i915]
[  271.902293]  [<ffffffffa00a39a5>] i915_gem_set_domain_ioctl+0x115/0x140 [i915]
[  271.902328]  [<ffffffffa00139ac>] drm_ioctl+0x1ec/0x660 [drm]
[  271.902354]  [<ffffffff8120aff0>] do_vfs_ioctl+0x2e0/0x4a0
[  271.902376]  [<ffffffff8120b231>] SyS_ioctl+0x81/0xa0
[  271.902399]  [<ffffffff81722129>] system_call_fastpath+0x16/0x1b
[  271.902422] Code: ff e8 88 0c f7 ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 ff e8 35 fc ff ff e9 30 ff ff ff 4c 89 ff e8 68 fb ff ff e9 39 ff ff ff <0f> 0b 90 0f 1f 44 00 00 55 48 89 e5 53 48 8b 47 28 48 89 fb 48 
[  271.902941] RIP  [<ffffffffa00a166d>] i915_gem_object_retire__read+0x16d/0x170 [i915]
[  271.902985]  RSP <ffff8800a02f7c78>
[  271.939684] ---[ end trace 5bc289903bbf7885 ]---
[  271.939688] Kernel panic - not syncing: Fatal exception
[  271.939715] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  271.939754] drm_kms_helper: panic occurred, switching back to text console
Comment 26 Simon Farnsworth 2014-09-18 17:05:33 UTC
Created attachment 106519 [details]
Error state after patch from comment #24 is applied

I applied the patch from #24 to Linus's tree, and got GPU hangs (see attached error). Wrong tree?

I'm not going to be able to do more tonight - 2 year old is interested in what I'm doing.
Comment 27 Chris Wilson 2014-09-18 20:35:07 UTC
(In reply to comment #26)
> Created attachment 106519 [details]
> Error state after patch from comment #24 is applied
> 
> I applied the patch from #24 to Linus's tree, and got GPU hangs (see
> attached error). Wrong tree?

That's fine. It was just a stab in the dark.

As for the BUG() the assert looks valid, but I haven't seen how it could end up there. Oh well.
Comment 28 Chris Wilson 2014-09-18 20:49:47 UTC
Could you get drm.debug=7 dmesg for the BUG()? I don't it will give anything else, but maybe it will have a nugget of gold in there. Best would slub debug=y (use-after-free checks) or even kmemcheck.
Comment 29 Chris Wilson 2014-09-19 10:08:56 UTC
Still scratching my head over that BUG(). I've splattered a few more into #requests, if you could be so kind as to see if that changes the oops.

Meanwhile, current theory is that maybe it is the CS programming around the ctx switch that is the significant change in the shotgun. Still thinking.
Comment 30 Simon Farnsworth 2014-09-19 16:59:04 UTC
(In reply to comment #28)
> Could you get drm.debug=7 dmesg for the BUG()? I don't it will give anything
> else, but maybe it will have a nugget of gold in there. Best would slub
> debug=y (use-after-free checks) or even kmemcheck.

I've turned on slub debug, but drm.debug=7 dmesg flows too fast to send over netconsole, with lots of repeats of:

[  164.320029] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[  164.320055] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_BUSY
[  164.320056] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE
[  164.320060] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE
[  164.320062] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE
[  164.320064] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE
[  164.320066] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_BUSY
[  164.320067] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE
[  164.320069] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE
[  164.320071] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2

If I hit the BUG() again, I'll give you whatever I can get.
Comment 31 Simon Farnsworth 2014-09-19 17:46:58 UTC
(In reply to comment #29)
> Still scratching my head over that BUG(). I've splattered a few more into
> #requests, if you could be so kind as to see if that changes the oops.
> 
> Meanwhile, current theory is that maybe it is the CS programming around the
> ctx switch that is the significant change in the shotgun. Still thinking.

I'm now testing your new #requests branch, as of

commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Sep 18 14:27:36 2014 +0100

    Revert "drm/i915: Enable full PPGTT on gen7"
    
    This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1.

I'll let you know the results on Monday.
Comment 32 Simon Farnsworth 2014-09-22 09:25:48 UTC
(In reply to comment #31)
> (In reply to comment #29)
> > Still scratching my head over that BUG(). I've splattered a few more into
> > #requests, if you could be so kind as to see if that changes the oops.
> > 
> > Meanwhile, current theory is that maybe it is the CS programming around the
> > ctx switch that is the significant change in the shotgun. Still thinking.
> 
> I'm now testing your new #requests branch, as of
> 
> commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Sep 18 14:27:36 2014 +0100
> 
>     Revert "drm/i915: Enable full PPGTT on gen7"
>     
>     This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1.
> 
> I'll let you know the results on Monday.

This commit, running with i915.enable_rc6=0 i915.enable_fbc=0 slub_debug drm.debug=7, has not failed on me.

[229832.518684] [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 7 p(0,-41)@ 230045.960557 -> 230045.961197 [e 0 us, 0 rep]
[229832.518741] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_RMFB
[229832.518744] [drm:__drm_framebuffer_unreference] ffff8800d48fc820: FB ID: 0 (2)
[229832.518748] [drm:drm_framebuffer_unreference] ffff8800d48fc820: FB ID: 0 (1)
[229832.518771] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.518861] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.518932] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.518998] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.519016] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, DRM_IOCTL_GEM_OPEN
[229832.519031] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_GET_TILING
[229832.519036] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN
[229832.519083] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SW_FINISH
[229832.519087] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.519214] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, DRM_IOCTL_GEM_CLOSE
[229832.519229] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.519233] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.519234] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.519236] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN
[229832.519302] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC
[229832.519336] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_WAIT_VBLANK
[229832.519340] [drm:drm_wait_vblank] waiting on vblank count 13778948, crtc 0
[229832.519342] [drm:drm_wait_vblank] returning 13778948 to client
[229832.519347] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC
[229832.519367] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_ADDFB
[229832.519384] [drm:drm_framebuffer_reference] ffff8800d48fc820: FB ID: 56 (1)
[229832.519387] [drm:drm_mode_addfb] [FB:56]
[229832.519390] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC
[229832.519405] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_PAGE_FLIP
[229832.519408] [drm:drm_framebuffer_reference] ffff8800d48fc820: FB ID: 56 (2)
[229832.519446] [drm:drm_framebuffer_unreference] ffff8800d48fcc30: FB ID: 57 (3)
[229832.519498] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.519535] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.527171] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527176] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.527184] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.527192] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527194] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN
[229832.527348] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527352] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.527433] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[229832.527482] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527483] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527495] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527497] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.527504] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.527508] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527510] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN
[229832.527767] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527770] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527772] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527773] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527775] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527779] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.527853] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[229832.527873] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527875] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527877] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527885] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527887] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527898] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527900] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527902] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.527904] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.527906] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.527909] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.527972] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[229832.528004] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.528006] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.528008] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.528012] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.528074] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.528161] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[229832.528218] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.528226] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.529757] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.529761] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.529770] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING
[229832.529794] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE
[229832.529798] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[229832.529868] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up
[229832.529922] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE
[229832.529943] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY
[229832.529949] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.529997] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.530068] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE
[229832.535331] [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 7 p(0,-41)@ 230045.977219 -> 230045.977860 [e 0 us, 0 rep]

is a single frame's worth of dmesg output.

I'm going to remove drm.debug=7 and retest.
Comment 33 Simon Farnsworth 2014-09-22 09:29:18 UTC
(In reply to comment #31)
> (In reply to comment #29)
> > Still scratching my head over that BUG(). I've splattered a few more into
> > #requests, if you could be so kind as to see if that changes the oops.
> > 
> > Meanwhile, current theory is that maybe it is the CS programming around the
> > ctx switch that is the significant change in the shotgun. Still thinking.
> 
> I'm now testing your new #requests branch, as of
> 
> commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Sep 18 14:27:36 2014 +0100
> 
>     Revert "drm/i915: Enable full PPGTT on gen7"
>     
>     This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1.
> 
> I'll let you know the results on Monday.

Without drm.debug=7, this gives me the VT race that ends in X logging:

[    79.069] (++) using VT number 1

[    79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
[    79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted
[    79.070] (EE) 
Fatal server error:
[    79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error
Comment 34 Chris Wilson 2014-09-22 09:54:59 UTC
(In reply to comment #33) 
> Without drm.debug=7, this gives me the VT race that ends in X logging:
> 
> [    79.069] (++) using VT number 1
> 
> [    79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
> [    79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted
> [    79.070] (EE) 
> Fatal server error:
> [    79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error

Not my fault! It's a race entirely in the VT layer. It's been plaguing my machines for many months but I haven't decyphered enough of the VT code to understand what it is trying to do, let alone why it is failing.

If you try to start X again, it will work - it's purely a timing issue afaict.

Would be good to know if the machine runs stable without the drm.debug, and what happens without i915.enable_rc6=0.
Comment 35 Simon Farnsworth 2014-09-22 11:07:55 UTC
(In reply to comment #34)
> (In reply to comment #33) 
> > Without drm.debug=7, this gives me the VT race that ends in X logging:
> > 
> > [    79.069] (++) using VT number 1
> > 
> > [    79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted
> > [    79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted
> > [    79.070] (EE) 
> > Fatal server error:
> > [    79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error
> 
> Not my fault! It's a race entirely in the VT layer. It's been plaguing my
> machines for many months but I haven't decyphered enough of the VT code to
> understand what it is trying to do, let alone why it is failing.
> 
> If you try to start X again, it will work - it's purely a timing issue
> afaict.
> 
> Would be good to know if the machine runs stable without the drm.debug, and
> what happens without i915.enable_rc6=0.

Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0, and the GPU hangs happen again.
Comment 36 Simon Farnsworth 2014-09-22 11:09:16 UTC
Created attachment 106669 [details]
error state gzip with #requests

Hang with #requests, kernel cmd line has i915.enable_fbc=0:

[ 1060.729567] [drm] GPU HANG: ecode 0:0x87d3bffa, in screen_manager [3142], reason: Ring hung, action: reset
[ 1060.729572] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1060.729574] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1060.729576] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1060.729578] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1060.729580] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1072.717358] [drm] stuck on render ring
[ 1072.719064] [drm] GPU HANG: ecode 0:0x87d3bffa, in screen_manager [3142], reason: Ring hung, action: reset
Comment 37 Chris Wilson 2014-09-22 11:31:25 UTC
(In reply to comment #35)
> Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0,
> and the GPU hangs happen again.

(In reply to comment #36)
> Hang with #requests, kernel cmd line has i915.enable_fbc=0:

Just for the sake of my sanity, can you confirm the command line settings used that resulted in the hang?
Comment 38 Chris Wilson 2014-09-22 11:33:38 UTC
For the record that last hang wasn't with my requests branch. (I could be in for a beating.)
Comment 39 Simon Farnsworth 2014-09-22 12:06:34 UTC
(In reply to comment #37)
> (In reply to comment #35)
> > Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0,
> > and the GPU hangs happen again.
> 
> (In reply to comment #36)
> > Hang with #requests, kernel cmd line has i915.enable_fbc=0:
> 
> Just for the sake of my sanity, can you confirm the command line settings
> used that resulted in the hang?

# cat /proc/cmdline 
BOOT_IMAGE=/bzImage root=/dev/mapper/NTBgroup-System20 ro log_buf_len=16M rd.md=0 rd.dm=0 LANG=en_GB.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=uk rd.luks=0 rd.lvm.lv=NTBgroup/System20 rd.lvm.lv=NTBgroup/Swap swapaccount=1 systemd.unit=signage.target net.ifnames=0 consoleblank=0 i915.enable_fbc=0 rhgb quiet
Comment 40 Simon Farnsworth 2014-09-22 13:12:15 UTC
(In reply to comment #38)
> For the record that last hang wasn't with my requests branch. (I could be in
> for a beating.)

Ooops, sorry, yes. That's Linus's tree with the patch from comment #24 applied.

I'm now testing #requests, to see if I can knock that over. So far, only thing I've had is:
[ 1224.506836] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle

and I'm not seeing any consequences from that.
Comment 41 Simon Farnsworth 2014-09-22 16:59:26 UTC
(In reply to comment #38)
> For the record that last hang wasn't with my requests branch. (I could be in
> for a beating.)

After an afternoon of beating on the device under test, I have no failures from the requests branch.

[ 1224.506836] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle
[ 8303.940102] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle

are the only things logged by the kernel, and the kernel was able to recover from both of them.
Comment 42 Chris Wilson 2014-09-22 19:36:42 UTC
Thanks, the missed irq/seqno coherency is worrying enough, but at least it confirms that there is some magic in there that seems to prevent the ctx load error. Do you mind keeping that test system running until a bug shows itself?
Comment 43 Simon Farnsworth 2014-09-23 07:57:24 UTC
(In reply to comment #42)
> Thanks, the missed irq/seqno coherency is worrying enough, but at least it
> confirms that there is some magic in there that seems to prevent the ctx
> load error. Do you mind keeping that test system running until a bug shows
> itself?

I can keep the test system running indefinitely, as long as I can get a stable-suitable patch ASAP to stop the GPU hang.
Comment 44 Simon Farnsworth 2014-09-25 17:08:08 UTC
(In reply to comment #42)
> Thanks, the missed irq/seqno coherency is worrying enough, but at least it
> confirms that there is some magic in there that seems to prevent the ctx
> load error. Do you mind keeping that test system running until a bug shows
> itself?

Still no further issues.

It looks like I can only provoke that message by restarting X and the compositor; would you like me to set that going in an endless loop and see if it BUG()s?
Comment 45 Chris Wilson 2014-09-25 20:14:19 UTC
(In reply to comment #44)
> (In reply to comment #42)
> > Thanks, the missed irq/seqno coherency is worrying enough, but at least it
> > confirms that there is some magic in there that seems to prevent the ctx
> > load error. Do you mind keeping that test system running until a bug shows
> > itself?
> 
> Still no further issues.
> 
> It looks like I can only provoke that message by restarting X and the
> compositor; would you like me to set that going in an endless loop and see
> if it BUG()s?

Nah, worked out the cause there. It is the ivb+ blt irq coherency bug, and a bad interaction of patches in my branch broke the w/a.

I've been trying to think as to what other magic could be in s/seqno/requests/ that fixup the ctx hang. I think we have more or less explored the ctx specific parts of the patch. So now what? :|
Comment 46 Chris Wilson 2014-09-28 07:03:51 UTC
(Just marking this bug for special interest, since we have a patch that seems to work, just not yet the right patch.)
Comment 47 Chris Wilson 2014-10-10 12:20:12 UTC
Small, but it forces the invalidate after the ctx load:

diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
index 1a0611bb576b..9676bc729f13 100644
--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1082,11 +1082,11 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file,
                }
        }
 
-       ret = i915_gem_execbuffer_move_to_gpu(ring, vmas);
+       ret = i915_switch_context(ring, ctx);
        if (ret)
                goto error;
 
-       ret = i915_switch_context(ring, ctx);
+       ret = i915_gem_execbuffer_move_to_gpu(ring, vmas);
        if (ret)
                goto error;
Comment 48 Simon Farnsworth 2014-10-10 14:52:59 UTC
(In reply to Chris Wilson from comment #47)
> Small, but it forces the invalidate after the ctx load:
> 
> diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> index 1a0611bb576b..9676bc729f13 100644
> --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
> @@ -1082,11 +1082,11 @@ i915_gem_ringbuffer_submission(struct drm_device
> *dev, struct drm_file *file,
>                 }
>         }
>  
> -       ret = i915_gem_execbuffer_move_to_gpu(ring, vmas);
> +       ret = i915_switch_context(ring, ctx);
>         if (ret)
>                 goto error;
>  
> -       ret = i915_switch_context(ring, ctx);
> +       ret = i915_gem_execbuffer_move_to_gpu(ring, vmas);
>         if (ret)
>                 goto error;

Applied the moral equivalent of that change to 3.16.3, and I see failure. I'll attach the error state.
Comment 49 Simon Farnsworth 2014-10-10 14:53:35 UTC
Created attachment 107665 [details]
Error state with invalidate after context switch
Comment 50 Chris Wilson 2014-10-10 15:04:13 UTC
(In reply to Simon Farnsworth from comment #49)
> Created attachment 107665 [details]
> Error state with invalidate after context switch

Looks like same error.  However, there is also massive corruption of the render ring. Either that or the error state capture is snafu. Would you be happy with a backport of mammoth patch if it proved to be stable?
Comment 51 Simon Farnsworth 2014-10-10 15:12:49 UTC
(In reply to Chris Wilson from comment #50)
> (In reply to Simon Farnsworth from comment #49)
> > Created attachment 107665 [details]
> > Error state with invalidate after context switch
> 
> Looks like same error.  However, there is also massive corruption of the
> render ring. Either that or the error state capture is snafu. Would you be
> happy with a backport of mammoth patch if it proved to be stable?

A 3,000 patch, 80 MB patchset would be fine if it were stable on HSW and IVB.
Comment 52 Rodrigo Vivi 2014-10-15 21:34:01 UTC
*** Bug 80229 has been marked as a duplicate of this bug. ***
Comment 53 Simon Farnsworth 2014-10-18 14:40:43 UTC
Created attachment 108029 [details] [review]
Backport of requests and PPGTT changes to 3.17.0

I've backported the changes from #requests to apply against the kernel RPM from http://koji.fedoraproject.org/koji/buildinfo?buildID=583526

This is a fairly intrusive backport - I've tried to take drivers/gpu/drm/i915 wholesale, then remove execlists/logical ring contexts rather than piecemeal bring things forwards.

I'd appreciate it if someone could look over what I've done, and tell me if it makes sense.
Comment 54 Rainer Hochecker 2014-10-22 15:43:03 UTC
I tested Simon's patch applied to a Ubuntu 3.17.1 kernel on a 1820T system. The GPU hang did not show during a three hours test. In general it shows within the first 20 min after system start. Will do more tests.
Comment 55 Rainer Hochecker 2014-10-22 19:25:57 UTC
This patch seems to introduce new problems for me. I place fences into the render pipeline and after a glFlush I don't get all of them into the signaled state. This procedure worked without this patch and with other driver like NVidia and AMD so the issue most likely got introduced with this huge patch.
Comment 56 Chris Wilson 2014-10-22 19:35:44 UTC
(In reply to Rainer Hochecker from comment #55)
> This patch seems to introduce new problems for me. I place fences into the
> render pipeline and after a glFlush I don't get all of them into the
> signaled state. This procedure worked without this patch and with other
> driver like NVidia and AMD so the issue most likely got introduced with this
> huge patch.


diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 6bf2dcf67bf2..158abb4c322a 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2544,13 +2544,16 @@ i915_gem_idle_work_handler(struct work_struct *work)
 static int
 i915_gem_object_flush_active(struct drm_i915_gem_object *obj)
 {
-       int ret;
+       int ret, n;
 
        if (!obj->active)
                return 0;
 
-       if (obj->last_write.request) {
-               ret = i915_request_emit_breadcrumb(obj->last_write.request);
+       for (n = 0; < I915_NUM_ENGINES; n++) {
+               if (obj->last_read[n].request == NULL)
+                       continue;
+
+               ret = i915_request_emit_breadcrumb(obj->last_read[n].request);
                if (ret)
                        return ret;


which I hope is overkill. If you have a snippet demonstrating the fences I can trace through mesa and see if there is a more precise flush we can do.
Comment 57 Rainer Hochecker 2014-10-22 20:12:18 UTC
Not sure if this is realated: currently vaapi lacks a good interop method with gl so we use vaPutSurface and texture-from-pixmap. The entire render pipeline is : decode with vaapi, vaapi postprocessing (deinterlacing), vaPutSurface (render vaapi video surface into pixmap), map pixmap to texture, render texture, place fence
when fence signals, we know that video surface (and some other resources) are ready for reuse

this places the fence:
https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L1202

this function checks for signaled fences:
https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L2022

I have had video playing for 4 hours today without any issues. but as soon as I stop playback it waits for all fences to be signaled, which is COutput::ProcessSyncPicture to return false but this does not happen. There is at least one fence not in GL_SIGNALED state.
When playback is stopped there is a glFlush before COutput::ProcessSyncPicture
https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L1709

It never comes out of the while loop at the next line.
Comment 58 Rainer Hochecker 2014-10-22 20:47:10 UTC
Peter has built a new kernel with the last patch. will test this tomorrow and report back
Comment 59 Rainer Hochecker 2014-10-23 06:14:34 UTC
The patch in comment 56 fixes the issue.
Comment 60 Peter Frühberger 2014-10-23 18:07:29 UTC
Created attachment 108310 [details]
chrashlog on Google chromebox

Attached zip contains dmesg and error after the problem reoccured with the latest chris wilson patches on a kernel 3.17.1 kernel.

Hardware: Celeron 2955U.
Comment 61 Peter Frühberger 2014-10-23 18:08:33 UTC
Sadly one of our users still can get frequent crashes with the latest code linked here, which is the extracted patch of Simon and the additional fix chris willson made for fences problem.

Perhaps the dmesg helps a bit in that case, cause it looks much more detailed than before.
Comment 62 Hugh Greenberg 2014-10-26 16:19:29 UTC
Created attachment 108459 [details]
dmesg output after suspend/resume

I tried Simon's patch for kernel 3.17.0 and Chris Wilson's latest patch and I have not experienced the hangs yet.  However, I noticed that resume from suspend no longer works.  The screen flickers and then remains black.  Attached is the dmesg output.
Comment 63 Ferry Toth 2014-10-27 07:35:01 UTC
I have this exact bug too.

My hardware is an (ex-)Chromebook Acer C720P with Haswell-ULT graphcs. I'm running linux-3.17 with xorg 1.16 and intel 2.99.916.

The desktop has kwin, which autodetects the hang and resumes as if nothing happend, except for a 5 second or so pause. So no, x restarting or other bad crashes.

What I've noted the hang happens occasionally when running firefox (1x per day), but often running Chromium (1x per hour?).

How can I help?
Comment 64 Hugh Greenberg 2014-10-30 05:15:27 UTC
I just noticed that if I use this kernel option: i915.enable_ppgtt=0, I get a different hang:

8.562995] [drm] stuck on render ring
[  498.564286] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: Ring hung, action: reset
[  498.564289] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  498.564290] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  498.564291] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  498.564293] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  498.564294] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  507.545019] [drm] stuck on render ring
[  507.546315] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: Ring hung, action: reset
[  507.546921] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!

It seems like this hang has been fixed here: https://www.libreoffice.org/bugzilla/show_bug.cgi?id=78533 , but when I compare the patch in that post with the kernel 3.17.1, there are things missing.  For example, I see that this doesn't fully match up:

--- a/drivers/gpu/drm/i915/i915_drv.h
+++ b/drivers/gpu/drm/i915/i915_drv.h
@@ -2139,13 +2139,16 @@ void i915_init_vm(struct drm_i915_private *dev_priv,
 void i915_gem_free_object(struct drm_gem_object *obj);
 void i915_gem_vma_destroy(struct i915_vma *vma);

-#define PIN_MAPPABLE 0x1
-#define PIN_NONBLOCK 0x2
-#define PIN_GLOBAL 0x4
+#define PIN_OFFSET_FIXED 0x1
+#define PIN_OFFSET_BIAS 0x2
+#define PIN_MAPPABLE 0x4
+#define PIN_NONBLOCK 0x8
+#define PIN_GLOBAL 0x10
+#define PIN_OFFSET_MASK (~4095)


Is this something worth investigating, or am I wasting my time?
Comment 65 Hugh Greenberg 2014-11-02 05:42:49 UTC
I can confirm that patch posted by Simon with i915.enable_rc6=0 does not fix the issue. I also looked into my previous question and it didn't help.
Comment 66 Chris Wilson 2014-11-02 19:10:58 UTC
*** Bug 85765 has been marked as a duplicate of this bug. ***
Comment 67 Hugh Greenberg 2014-11-05 17:46:29 UTC
This comment was the result of test with kernel 3.17.1 with the patch submitted by Simon here.

 only so far.(In reply to Hugh Greenberg from comment #64)
> I just noticed that if I use this kernel option: i915.enable_ppgtt=0, I get
> a different hang:
> 
> 8.562995] [drm] stuck on render ring
> [  498.564286] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason:
> Ring hung, action: reset
> [  498.564289] [drm] GPU hangs can indicate a bug anywhere in the entire gfx
> stack, including userspace.
> [  498.564290] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [  498.564291] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [  498.564293] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [  498.564294] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [  507.545019] [drm] stuck on render ring
> [  507.546315] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason:
> Ring hung, action: reset
> [  507.546921] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast,
> banning!
> 
> It seems like this hang has been fixed here:
> https://www.libreoffice.org/bugzilla/show_bug.cgi?id=78533 , but when I
> compare the patch in that post with the kernel 3.17.1, there are things
> missing.  For example, I see that this doesn't fully match up:
> 
> --- a/drivers/gpu/drm/i915/i915_drv.h
> +++ b/drivers/gpu/drm/i915/i915_drv.h
> @@ -2139,13 +2139,16 @@ void i915_init_vm(struct drm_i915_private *dev_priv,
>  void i915_gem_free_object(struct drm_gem_object *obj);
>  void i915_gem_vma_destroy(struct i915_vma *vma);
> 
> -#define PIN_MAPPABLE 0x1
> -#define PIN_NONBLOCK 0x2
> -#define PIN_GLOBAL 0x4
> +#define PIN_OFFSET_FIXED 0x1
> +#define PIN_OFFSET_BIAS 0x2
> +#define PIN_MAPPABLE 0x4
> +#define PIN_NONBLOCK 0x8
> +#define PIN_GLOBAL 0x10
> +#define PIN_OFFSET_MASK (~4095)
> 
> 
> Is this something worth investigating, or am I wasting my time?
Comment 68 Hugh Greenberg 2014-11-05 17:47:42 UTC
Chris Wilson gave me the following kernel parameter to try:

i915.enable_ppgtt=0

on a stock kernel (not including the giant patch referenced here) and I have not been able to reproduce the hangs with it. I've tested with kernel 3.17.1 and 3.17.2.  Please confirm or deny that this works for you.  Thanks.
Comment 69 Rainer Hochecker 2014-11-05 17:55:43 UTC
(In reply to Hugh Greenberg from comment #68)
> Chris Wilson gave me the following kernel parameter to try:
> 
> i915.enable_ppgtt=0
> 
> on a stock kernel (not including the giant patch referenced here) and I have
> not been able to reproduce the hangs with it. I've tested with kernel 3.17.1
> and 3.17.2.  Please confirm or deny that this works for you.  Thanks.

I tried this 3 weeks ago and did not help:
https://bugs.freedesktop.org/show_bug.cgi?id=80229#c62
Comment 70 Hugh Greenberg 2014-11-05 18:02:58 UTC
(In reply to Rainer Hochecker from comment #69)
> (In reply to Hugh Greenberg from comment #68)
> > Chris Wilson gave me the following kernel parameter to try:
> > 
> > i915.enable_ppgtt=0
> > 
> > on a stock kernel (not including the giant patch referenced here) and I have
> > not been able to reproduce the hangs with it. I've tested with kernel 3.17.1
> > and 3.17.2.  Please confirm or deny that this works for you.  Thanks.
> 
> I tried this 3 weeks ago and did not help:
> https://bugs.freedesktop.org/show_bug.cgi?id=80229#c62

Yes, you are right.  I just experienced the hang again.  Sorry.
Comment 71 Hugh Greenberg 2014-11-05 18:05:06 UTC
Created attachment 108976 [details]
dump after error with i915.enable_ppgtt=0

This is the error dump I got after booting the kernel with: i915.enable_ppgtt=0 .
Comment 72 Peter Frühberger 2014-11-05 18:09:18 UTC
Am I correct in the summary that especially Celeron, Pentium HSW GPUs are affected?

Our testing on the xbmc forums shows the same results. It seems only the simple HSW GPUs frequently run into this hang.

Perhaps someone could have a look if the "ringbuffer" is one bit off or something happens, when we get rounding, clamping, fragmentation anything? Some Flush missing as in the past in Mesa?

It might very well be that other higher GPU series also have that issue, but they are not hit that frequently cause of more Execution Units? Perhaps more load can stall them too?
Comment 73 Ferry Toth 2014-11-05 20:56:39 UTC
I have Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09)
(Intel Celeron 2955U)

I guess that matches the profile?

This is on a Acer C720P Chromebook booting using the built-in Seabios (developer mode).

Strange thing: I notice hangs often when running chromium, but never when booting ChromeOS (which runs Chrome). So either Google fixed something in their kernel, or they have GPU configured differently when coreboot boots ChromeOS then when coreboot boots Seabios. Or, more recent kernels then ChromeOS's introduce the problem?
Comment 74 Peter Frühberger 2014-11-05 20:58:02 UTC
Can you try the chrome kernel version on your linux installation? Ubuntu mainline has some of those (if you are using Ubuntu). That would be a good start for bisecting.
Comment 75 Ferry Toth 2014-11-05 21:44:44 UTC
They have? I didn't know that. Which kernel would you like me to try?

Up to now I had 3.13 from 14.04 with some patched up drivers to get touchpad and touchscreen working, and 3.17rc?, 3.17 and 3.17.1 from kernel ppa (mainline kernels). 3.16 from 14.10 does not have the drivers included and no patched drivers available afaik, so doesn't work to well. Also tried 3.18rc? but that seem to be in the best shape right now (rc1 didn't even boot). 3.13 + patches and 3.17 seem to run equally well but with the exact same GPU hang.

Also xorg-edgers ppa makes no change.
Comment 76 Hugh Greenberg 2014-11-06 05:55:40 UTC
Created attachment 109009 [details] [review]
possible hang fix

This is a small patch to change a register definition that I think is wrong.  It is against 3.17.2, but it should work for at least any 3.17 kernel.  Please let me know if it fixes the issue or not.
Comment 77 Hugh Greenberg 2014-11-06 05:57:35 UTC
(In reply to Hugh Greenberg from comment #76)
> Created attachment 109009 [details] [review] [review]
> possible hang fix
> 
> This is a small patch to change a register definition that I think is wrong.
> It is against 3.17.2, but it should work for at least any 3.17 kernel. 
> Please let me know if it fixes the issue or not.

No special boot parameters are needed.
Comment 78 Peter Frühberger 2014-11-06 08:05:59 UTC
For easy testing I build Ubuntu kernel packages based on my gpuhang branch at: https://github.com/fritsch/linux/tree/gpuhang

This is stable 3.17.2 with the patch Hugh Greenberg provided:

https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.2-fix-gpu-hang%2B_3.17.2-fix-gpu-hang%2B-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.2-fix-gpu-hang%2B_3.17.2-fix-gpu-hang%2B-10.00.Custom_amd64.deb

Happy testing to those that run the affected hardware.
Comment 79 Peter Frühberger 2014-11-06 10:30:05 UTC
Created attachment 109018 [details]
gpu hang error with Greenberg patch

Kernel hang with patch provide by Hugh Greenberg
Comment 80 Peter Frühberger 2014-11-06 10:31:08 UTC
Created attachment 109019 [details]
dmesg 3.17.2 + Greenberg patch
Comment 81 Peter Frühberger 2014-11-06 11:16:30 UTC
Never the less I think you are onto somethin.

[  869.806084] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  869.806084] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  869.806085] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  869.806086] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  974.865662] [drm] stuck on render ring
[  974.869560] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 1248.988099] [drm] stuck on render ring
[ 1248.992108] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 1354.039617] [drm] stuck on render ring
[ 1354.043649] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 1472.097518] [drm] stuck on render ring
[ 1472.101540] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 2204.468693] [drm] stuck on render ring
[ 2204.472663] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 2430.567575] [drm] stuck on render ring
[ 2430.571278] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 3101.896809] [drm] stuck on render ring
[ 3101.900614] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 5090.884258] [drm] stuck on render ring
[ 5090.888231] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset
[ 5402.024853] [drm] stuck on render ring
[ 5402.028782] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset

It seems (no scientific proof) the hang occurs even more frequently with the patch applied. Every HANG one can see there will freeze the render for a specific amount of time - which has a huge visual impact.

Happy to test other ideas.
Comment 82 Mika Kuoppala 2014-11-06 13:43:53 UTC
*** Bug 85503 has been marked as a duplicate of this bug. ***
Comment 83 Hugh Greenberg 2014-11-06 15:02:13 UTC
Thanks for trying it.  The patch is wrong, sorry.  I'll keep working on it.
To reproduce this in Kodi, are there any settings that I need to enable?

(In reply to Peter Frühberger from comment #81)
> Never the less I think you are onto somethin.
> 
> [  869.806084] [drm] Please file a _new_ bug report on bugs.freedesktop.org
> against DRI -> DRM/Intel
> [  869.806084] [drm] drm/i915 developers can then reassign to the right
> component if it's not a kernel issue.
> [  869.806085] [drm] The gpu crash dump is required to analyze gpu hangs, so
> please always attach it.
> [  869.806086] [drm] GPU crash dump saved to /sys/class/drm/card0/error
> [  974.865662] [drm] stuck on render ring
> [  974.869560] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 1248.988099] [drm] stuck on render ring
> [ 1248.992108] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 1354.039617] [drm] stuck on render ring
> [ 1354.043649] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 1472.097518] [drm] stuck on render ring
> [ 1472.101540] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 2204.468693] [drm] stuck on render ring
> [ 2204.472663] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 2430.567575] [drm] stuck on render ring
> [ 2430.571278] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 3101.896809] [drm] stuck on render ring
> [ 3101.900614] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 5090.884258] [drm] stuck on render ring
> [ 5090.888231] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> [ 5402.024853] [drm] stuck on render ring
> [ 5402.028782] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858],
> reason: Ring hung, action: reset
> 
> It seems (no scientific proof) the hang occurs even more frequently with the
> patch applied. Every HANG one can see there will freeze the render for a
> specific amount of time - which has a huge visual impact.
> 
> Happy to test other ideas.
Comment 84 Peter Frühberger 2014-11-06 15:08:34 UTC
Quite easy,

build latest master (helix beta1) via https://github.com/xbmc/xbmc/commits/master or use a nightly ppa. Enable VAAPI (disable VDPAU). Under Video -> Acceleration, check that "Prefer VAAPI Render Method" is enabled, which is the default. You need to switch the settings hierarchy to "Expert" to see those settings.

You can highly provoke that error by watching interlaced content and using Motion Compensation Deinterlacing with VPP ontop. 

Beware, there is no release vaapi driver that supports this yet, you need to use: http://cgit.freedesktop.org/vaapi/intel-driver/log/ e.g. the master branch, Gwenole repaired the driver (vebox fixes) some months ago and pushed the results into libva master branch last week.

While playing such an interlaced video, press return, select the video film role and activate Deinterlace: Auto Deinterlace-Method: Moption Compensation Deinterlacing. Save for all files.

Now wait <= 10 minutes and see it hanging.

Btw. using vpp deinterlacing seems to stress the GPU more, so you don't need to keep it running for hours (as it would be with progressive content).

You need a Celeron HSW platform to reproduce. On my Core Systems the issue nearly never happens.
Comment 85 Ferry Toth 2014-11-06 18:54:24 UTC
@Peter Frühberger  Can you tell me the name of the package you refer to in comment #74?
Comment 86 Peter Frühberger 2014-11-06 18:55:38 UTC
There is no package. That was a question, only. A short google revealed that the chrome guys seems to use something highly customized, so I don't think such a package exists.
Comment 87 Hugh Greenberg 2014-11-07 19:33:33 UTC
I can't figure this out.  I'm not an Intel or kernel developer, but maybe I could figure it out with hardware debugging support or docs that were correct.  I would recommend that we just stop purchasing Intel GPUs and go with Nvidia based GPUs.  This bug report is one of many for the same bug.  My first report was from June.  I really don't think this is going to get fixed.
Comment 88 Peter Frühberger 2014-11-07 20:40:23 UTC
Yeah. You exactly make the right point. And sorry - I thought you were an intel dev last time :-). I even searched for you in the intel channel. Now I know why the other intel devs could not find you.

Thanks much for trying to help.

It really feels like being a 3rd party citizen. I am not sure what else we could do to solve that issue.


I will also ignore that bugreport from now on .. I have a PR ready to remove VAAPI from xbmc. I think this will be a good signal for protesting.
Comment 89 Hugh Greenberg 2014-11-07 22:09:25 UTC
(In reply to Peter Frühberger from comment #88)
> Yeah. You exactly make the right point. And sorry - I thought you were an
> intel dev last time :-). I even searched for you in the intel channel. Now I
> know why the other intel devs could not find you.
> 
> Thanks much for trying to help.
> 
> It really feels like being a 3rd party citizen. I am not sure what else we
> could do to solve that issue.
> 
> 
> I will also ignore that bugreport from now on .. I have a PR ready to remove
> VAAPI from xbmc. I think this will be a good signal for protesting.

No problem.  I should have made that clear.

I have an Acer C720 and I've making Linux distributions for it so other Acer C720 owners that want Linux can easy install it without having to figure a ton of things out.  This was the last major bug that I have encountered.  I also started a site around those distros (distroshare.com) so there could be a single place for others to share such distributions.

I'm a big fan of Kodi/XMBC btw.  Thanks for such an awesome software.
Comment 90 Hugh Greenberg 2014-11-08 01:34:58 UTC
In case you weren't aware, this bug will actually affect any hardware acceleration path that XBMC takes, not just the VAPPI one.  It just seems to show up more with VAPPI.  Anything that uses the dri/drm layers will encounter this bug.

Maybe you can direct your users that encounter this bug to submit a bug report. 

(In reply to Peter Frühberger from comment #88)
> Yeah. You exactly make the right point. And sorry - I thought you were an
> intel dev last time :-). I even searched for you in the intel channel. Now I
> know why the other intel devs could not find you.
> 
> Thanks much for trying to help.
> 
> It really feels like being a 3rd party citizen. I am not sure what else we
> could do to solve that issue.
> 
> 
> I will also ignore that bugreport from now on .. I have a PR ready to remove
> VAAPI from xbmc. I think this will be a good signal for protesting.
Comment 91 M. Kramer 2014-11-09 17:16:20 UTC
Created attachment 109165 [details]
GPU-hang dmesg output on Pentium G3420 using OpenELEC 4.2.1/XBMC

Hi, 

I'm having the crashes described here quite often while running XBMC on OpenELEC 4.2.1 (latest stable):

[15194.281458] [drm] stuck on render ring
[15194.282221] [drm] GPU HANG: ecode 0:0x87d3bffa, in xbmc.bin [780], reason: Ring hung, action: reset
[15196.281552] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

Complete dmesg attached. 

This causes the video to freeze every few seconds, while audio continues normally. After the hang the video 'catches up' by skipping a lot of frames until it's in sync again. I'm using x264 mpeg4 source material. 

This is a very annoying bug for me and seems to make my new media system seem like a waste of money. 

Please investigate this issue further!

Thanks,
M. Kramer
Comment 92 adr3nal1n 2014-11-09 17:58:29 UTC
Hi,

I have a haswell pentium G3240 cpu and adding the following kernel parameters to my bootloader seem to prevent the gpu hangs for me.

i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1 

I don't know how to add these kernel parameters in openelec so you will need to ask on their forum how to do this.

Hope this helps.
Comment 93 adr3nal1n 2014-11-09 18:37:00 UTC
Also note that this guy with a haswell chromebook uses only the following kernel parameters to prevent the gpu hangs  https://johnlewis.ie/tentative-fixwork-around-for-i915-gpu-hangs/ I think i'll try reducing it down to these options too.

drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0

I am running arch linux with kernel 3.17.2 and the following elements:

- libva-intel-driver 1.4.1
- libva 1.4.1
- xf86-video-intel 2.99.916
- xbmc 13.2

Hope this information helps.
Comment 94 Hugh Greenberg 2014-11-09 20:06:28 UTC
We all tried the fix from John Lewis.  It caused system hangs and we couldn't figure out which options actually helped.

I'm trying your modified version of the command line and so far I've been able to use Kodi with VAPPI for a long time (2 hours I think) without a hang or freeze.  I will keep it going for the rest of day though before I am sure that this really works.

(In reply to adr3nal1n from comment #92)
> Hi,
> 
> I have a haswell pentium G3240 cpu and adding the following kernel
> parameters to my bootloader seem to prevent the gpu hangs for me.
> 
> i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1
> drm.vblankoffdelay=1 
> 
> I don't know how to add these kernel parameters in openelec so you will need
> to ask on their forum how to do this.
> 
> Hope this helps.
Comment 95 Hugh Greenberg 2014-11-09 20:16:23 UTC
This bug report is where we were testing that fix: https://bugs.freedesktop.org/show_bug.cgi?id=80229. Comment 58 has the same command line as you: https://bugs.freedesktop.org/show_bug.cgi?id=80229#c58, except for the vblankoffdelay, and encountered system freezes.

(In reply to adr3nal1n from comment #93)
> Also note that this guy with a haswell chromebook uses only the following
> kernel parameters to prevent the gpu hangs 
> https://johnlewis.ie/tentative-fixwork-around-for-i915-gpu-hangs/ I think
> i'll try reducing it down to these options too.
> 
> drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0
> 
> I am running arch linux with kernel 3.17.2 and the following elements:
> 
> - libva-intel-driver 1.4.1
> - libva 1.4.1
> - xf86-video-intel 2.99.916
> - xbmc 13.2
> 
> Hope this information helps.
Comment 96 adr3nal1n 2014-11-09 20:30:19 UTC
Hope your testing goes well Hugh,

I have been using (i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1) with xbmc gotham for a couple of days now with no hangs.

I'll post again if i notice any hangs over the coming days. (I normally use xbmc for a few hours a day)
Comment 97 Hugh Greenberg 2014-11-09 20:59:03 UTC
If someone could figure out how to dump the GPU instructions on a Windows machine with a Haswell chipset, I think we could develop a patch. Why the Intel developers couldn't do this is beyond me.
Comment 98 adr3nal1n 2014-11-10 11:10:54 UTC
(In reply to adr3nal1n from comment #96)
> Hope your testing goes well Hugh,
> 
> I have been using (i915.semaphores=0 i915.use_mmio_flip=1
> i915.enable_ppgtt=1 drm.vblankoffdelay=1) with xbmc gotham for a couple of
> days now with no hangs.
> 
> I'll post again if i notice any hangs over the coming days. (I normally use
> xbmc for a few hours a day)

Hi Hugh,

Just wanted to let you know that XBMC dev fritsch stated the following regarding the use of the above kernel parameters. "We made longtime tests and the same happens after 12 hours or more. So it seems to make the bug "more unlikely", but if you add additional load on the GPU (as we do with v14 and the VPP Deinterlacers), you will get the hang again."
Comment 99 Peter Frühberger 2014-11-10 11:12:01 UTC
I am this guy (https://bugs.freedesktop.org/show_bug.cgi?id=83677#c84) aka fritsch.
Comment 100 Hugh Greenberg 2014-11-10 16:31:12 UTC
You should return it if you can.  This bug has been around for more than 6 months.  It doesn't seem like it will be fixed any time soon.

(In reply to M. Kramer from comment #91)
> Created attachment 109165 [details]
> GPU-hang dmesg output on Pentium G3420 using OpenELEC 4.2.1/XBMC
> 
> Hi, 
> 
> I'm having the crashes described here quite often while running XBMC on
> OpenELEC 4.2.1 (latest stable):
> 
> [15194.281458] [drm] stuck on render ring
> [15194.282221] [drm] GPU HANG: ecode 0:0x87d3bffa, in xbmc.bin [780],
> reason: Ring hung, action: reset
> [15196.281552] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
> 
> Complete dmesg attached. 
> 
> This causes the video to freeze every few seconds, while audio continues
> normally. After the hang the video 'catches up' by skipping a lot of frames
> until it's in sync again. I'm using x264 mpeg4 source material. 
> 
> This is a very annoying bug for me and seems to make my new media system
> seem like a waste of money. 
> 
> Please investigate this issue further!
> 
> Thanks,
> M. Kramer
Comment 101 Menno 2014-11-10 21:07:30 UTC
I also suffer from this bug,
an openelec bug can be seen here'
http://sprunge.us/QBKV

and the gpu crash log may be downloaded here;

http://www.demenno.nl/error.txt

thats almost 4MB.

Intel please fix this asap!
Comment 102 Barry Scott 2014-11-12 10:50:33 UTC
Created attachment 109331 [details]
1037U kernel BUG traceback in i915 code

The picture shows a kernel BUG that we can reproduce on IvyBridge CPUs.
This traceback is from a 1037U running the kernel patched by Simon Farnsworth with the advice of Chris Wilson.

We can reproduce but not at will.
We have test code that will provoke the bug given enough test runs.
Comment 103 Menno 2014-11-12 14:28:28 UTC
use hw de-interlacers and you'll reproduce in max 11 minutes, every time. (see my logs).
Comment 104 Hugh Greenberg 2014-11-13 00:46:35 UTC
(In reply to Barry Scott from comment #102)
> Created attachment 109331 [details]
> 1037U kernel BUG traceback in i915 code
> 
> The picture shows a kernel BUG that we can reproduce on IvyBridge CPUs.
> This traceback is from a 1037U running the kernel patched by Simon
> Farnsworth with the advice of Chris Wilson.
> 
> We can reproduce but not at will.
> We have test code that will provoke the bug given enough test runs.

Would you mind sharing how you did that?
Comment 105 Chris Wilson 2014-11-14 12:14:38 UTC
Created attachment 109459 [details] [review]
Force a CS stall inside gen7 invalidate-caches
Comment 106 Peter Frühberger 2014-11-14 14:32:44 UTC
Here are 3.17.2 mainline kernel builds with the patch applied:

https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.2wilsonv1_3.17.2wilsonv1-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.2wilsonv1_3.17.2wilsonv1-10.00.Custom_amd64.deb

I will be traveling until Sunday, so give those a nice test, please.
Comment 107 Peter Frühberger 2014-11-14 16:13:19 UTC
I made a short test on my 1820T: http://paste.ubuntu.com/9008515/ - did not help, got the gpu hang after exactly 20 seconds.
Comment 108 Peter Frühberger 2014-11-18 05:36:36 UTC
Here are new test kernels, chris wilson wants you to test:

https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-rc5-icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-rc5-icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb

You can find a fork of this branch on github (to download with a faster connection): https://github.com/fritsch/linux/commits/ickle-master

Would be nice, if you could try it.

Feedback would be nice.
Comment 109 Hugh Greenberg 2014-11-18 06:04:05 UTC
I tried it, and while LightDM loaded, I couldn't log in since X crashed on login.  Here is a stack trace:

(gdb) where
#0  0x00007f5857825d27 in __GI_raise (sig=sig@entry=6)
    at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1  0x00007f5857827418 in __GI_abort () at abort.c:89
#2  0x00007f5859c77f6e in OsAbort () at ../../os/utils.c:1361
#3  0x00007f5859c7d7c3 in AbortServer () at ../../os/log.c:786
#4  0x00007f5859c7e60d in FatalError (
    f=f@entry=0x7f5859c922fc "%s: VT_ACTIVATE failed: %s\n")
    at ../../os/log.c:924
#5  0x00007f5859b78151 in switch_to (vt=7, 
    from=0x7f5859c92375 "xf86OpenConsole")
    at ../../../../../hw/xfree86/os-support/linux/lnx_init.c:72
#6  0x00007f5859b783e9 in xf86OpenConsole ()
    at ../../../../../hw/xfree86/os-support/linux/lnx_init.c:209
#7  0x00007f5859b54e9d in InitOutput (
    pScreenInfo=pScreenInfo@entry=0x7f5859f13b00 <screenInfo>, 
    argc=argc@entry=11, argv=argv@entry=0x7fff0f438a78)
    at ../../../../hw/xfree86/common/xf86Init.c:597
#8  0x00007f5859b160ba in dix_main (argc=11, argv=0x7fff0f438a78, 
    envp=<optimized out>) at ../../dix/main.c:202
#9  0x00007f5857810ec5 in __libc_start_main (main=0x7f5859b00680 <main>, 
    argc=11, argv=0x7fff0f438a78, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fff0f438a68) at libc-start.c:287
#10 0x00007f5859b006ae in _start ()


(In reply to Peter Frühberger from comment #108)
> Here are new test kernels, chris wilson wants you to test:
> 
> https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-rc5-
> icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb
> https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-rc5-
> icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb
> 
> You can find a fork of this branch on github (to download with a faster
> connection): https://github.com/fritsch/linux/commits/ickle-master
> 
> Would be nice, if you could try it.
> 
> Feedback would be nice.
Comment 110 Chris Wilson 2014-11-18 07:37:33 UTC
(In reply to Hugh Greenberg from comment #109)
> I tried it, and while LightDM loaded, I couldn't log in since X crashed on
> login.  Here is a stack trace:

That's a bug in the VT layer; a race in the graphics mode takeover of the console. You have to restart X - I forget that lightdm doesn't handle that automatically.
Comment 111 Hugh Greenberg 2014-11-18 18:25:18 UTC
LightDM and the unity desktop worked after I disabled dri3 in the intel driver as Chris suggested. After that change and enabling tear free (also as Chris suggested), I was able to test this kernel. I was able to play a video in Kodi for over an hour using the VAAPI support as Peter described above and I did not experience a hang.


(In reply to Chris Wilson from comment #110)
> (In reply to Hugh Greenberg from comment #109)
> > I tried it, and while LightDM loaded, I couldn't log in since X crashed on
> > login.  Here is a stack trace:
> 
> That's a bug in the VT layer; a race in the graphics mode takeover of the
> console. You have to restart X - I forget that lightdm doesn't handle that
> automatically.
Comment 112 Peter Frühberger 2014-11-18 18:47:48 UTC
Tearfree is a nightmare for applications that count swapBuffers ...

But never the less that sounds promissing, now we need to find out which change is the real fix.

Can you post the xorg.conf sniplet you use to make it work?
Comment 113 Hugh Greenberg 2014-11-18 18:55:10 UTC
This is the config file that I put in /usr/share/X11/xorg.conf.d:

Section "Device"
  Identifier "Intel Graphics"
  Driver "intel"
  Option "TearFree" "true"
EndSection

Here is my x11 intel driver recompiled with the  --disable-dri3  option: https://drive.google.com/file/d/0B6zPD2kAJoTJcHdNS1J1VWpKY2s/view?usp=sharing .  I'm using the oibaf ppa for the latest graphics stack - https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers.

(In reply to Peter Frühberger from comment #112)
> Tearfree is a nightmare for applications that count swapBuffers ...
> 
> But never the less that sounds promissing, now we need to find out which
> change is the real fix.
> 
> Can you post the xorg.conf sniplet you use to make it work?
Comment 114 Peter Frühberger 2014-11-18 18:58:24 UTC
Thanks much. I will keep TearFree of, from "man intel" one sees why it is not wanted in xbmc context.
Comment 115 Hugh Greenberg 2014-11-18 18:59:38 UTC
If I didn't turn it off, I got hangs just by launching xbmc. You'll probably see the same thing.

(In reply to Peter Frühberger from comment #114)
> Thanks much. I will keep TearFree of, from "man intel" one sees why it is
> not wanted in xbmc context.
Comment 116 Hugh Greenberg 2014-11-18 19:00:01 UTC
I mean that if I didn't turn tear free on, I got the hangs. 

(In reply to Hugh Greenberg from comment #115)
> If I didn't turn it off, I got hangs just by launching xbmc. You'll probably
> see the same thing.
> 
> (In reply to Peter Frühberger from comment #114)
> > Thanks much. I will keep TearFree of, from "man intel" one sees why it is
> > not wanted in xbmc context.
Comment 117 Peter Frühberger 2014-11-18 19:05:48 UTC
Ah nice!

I have TearFree turned off, running with the "normal" intel drivers now. So I should also hit the bug now?
Comment 118 Hugh Greenberg 2014-11-18 19:10:18 UTC
You got much farther than me without modifying anything. I couldn't even launch xbmc without dri3 disabled and I got the hangs with TearFree turned off.

(In reply to Peter Frühberger from comment #117)
> Ah nice!
> 
> I have TearFree turned off, running with the "normal" intel drivers now. So
> I should also hit the bug now?
Comment 119 Peter Frühberger 2014-11-18 19:11:48 UTC
I am running the following intel packages: 
ii  xserver-xorg-video-intel                  2:2.99.910-0ubuntu1.1                      amd64        X.Org X server -- Intel i8xx, i9xx display driver

try those - standard packages. We have massive issues with everything > 910, as it seems it has issues with swap buffers, therefore we use 910 on all the machines.

Just purge the oibaf ppa.
Comment 120 Hugh Greenberg 2014-11-18 19:14:42 UTC
The reason why I didn't do that is because the vaapi driver that you were testing required a newer libva.  I didn't know if that had dependencies on the newer mesa or not. I guess not.

(In reply to Peter Frühberger from comment #119)
> I am running the following intel packages: 
> ii  xserver-xorg-video-intel                  2:2.99.910-0ubuntu1.1         
> amd64        X.Org X server -- Intel i8xx, i9xx display driver
> 
> try those - standard packages. We have massive issues with everything > 910,
> as it seems it has issues with swap buffers, therefore we use 910 on all the
> machines.
> 
> Just purge the oibaf ppa.
Comment 121 Peter Frühberger 2014-11-18 19:18:34 UTC
We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi

nothing else is needed. I run 10.3 mesa from utopic though.
Comment 122 Hugh Greenberg 2014-11-18 19:22:05 UTC
Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good so far. I'll keep xbmc going for a while again.

(In reply to Peter Frühberger from comment #121)
> We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi
> 
> nothing else is needed. I run 10.3 mesa from utopic though.
Comment 123 Hugh Greenberg 2014-11-18 19:55:04 UTC
I'm not sure when to celebrate, but I still haven't experienced any hangs.

(In reply to Hugh Greenberg from comment #122)
> Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good
> so far. I'll keep xbmc going for a while again.
> 
> (In reply to Peter Frühberger from comment #121)
> > We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi
> > 
> > nothing else is needed. I run 10.3 mesa from utopic though.
Comment 124 Hugh Greenberg 2014-11-18 20:43:52 UTC
I could be wrong again, but I'm guessing that this is the patch: https://github.com/fritsch/linux/commit/dba076df4b79d2472ef5d6e19b72ca3856eafb1a . I'll try just that patch and report back here later.

(In reply to Hugh Greenberg from comment #123)
> I'm not sure when to celebrate, but I still haven't experienced any hangs.
> 
> (In reply to Hugh Greenberg from comment #122)
> > Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good
> > so far. I'll keep xbmc going for a while again.
> > 
> > (In reply to Peter Frühberger from comment #121)
> > > We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi
> > > 
> > > nothing else is needed. I run 10.3 mesa from utopic though.
Comment 125 Peter Frühberger 2014-11-18 20:51:02 UTC
You know what :-)

I exactly thought the same. But I don't understand the code too much, so did not try. Can you "fix" 3.17.3 with that patch picked on top?
Comment 126 Hugh Greenberg 2014-11-18 20:58:46 UTC
Yes, I will do that and post the links here.

(In reply to Peter Frühberger from comment #125)
> You know what :-)
> 
> I exactly thought the same. But I don't understand the code too much, so did
> not try. Can you "fix" 3.17.3 with that patch picked on top?
Comment 127 Peter Frühberger 2014-11-18 21:00:40 UTC
Patch does not apply cleanly, the batch_buffer does not seem to be there. Let's wait what chris willson will tell us?
Comment 128 Peter Frühberger 2014-11-18 21:08:33 UTC
I picked and fixed what I think could be right to: https://github.com/fritsch/linux/tree/gpuhang
Comment 129 Peter Frühberger 2014-11-18 21:53:17 UTC
Save your time. Chris Wilson told on IRC, that this fix will 100% not fix our bug.
Comment 130 Hugh Greenberg 2014-11-24 14:39:28 UTC
I backported Chris Wilson's branch to 3.17.4.

Patch: 
https://drive.google.com/file/d/0B6zPD2kAJoTJNEZnczJ3YU1ickU/view?usp=sharing

Kernel debs: 
https://drive.google.com/file/d/0B6zPD2kAJoTJejNLdEFCS01lblk/view?usp=sharing
https://drive.google.com/file/d/0B6zPD2kAJoTJMXJlY3NYSVZfd2M/view?usp=sharing

I've tested these for 20+ hours and it has been working well. The only thing is that TearFree needs to be enabled in the intel driver until there is a patch available for that. You can enable it like this: https://wiki.archlinux.org/index.php/Intel_graphics#Tear-free_video .
Comment 131 Chris Wilson 2014-11-25 09:18:13 UTC
*** Bug 86670 has been marked as a duplicate of this bug. ***
Comment 132 dhead666 2014-11-25 15:08:33 UTC
The huge patch that ported to 3.17.4 by Hugh Greenberg is working very well for me.
After a day of work with the system I didn't experience any freezes or hangs with Chromium or at all.
I only kept the kernel parameter i915.modeset=1.

I did experienced a rare slowdown of Chromium with a segfault in journald, I didn't saw such segfault before.

kernel: WebCore: Worker[15893]: segfault at fbadbeef ip 00007f75abde2e25 sp 00007f758d175190 error 6 in chromium[7f75a993d000+5b6f000

What changed from upstream kernel is that with upstream:
* when no kernel parameter used (except i915.modeset=1) Chromium would hangs and would force me to kill it.
* when using the parameters: i915.modeset=1 i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1, instead of hanging, Chromium would slow down the system to almost a halt, Kodi would also triger such slowdowns (in much higher rate than with the patch), it seems like i915.semaphores=0 is the one making the difference between hang to slowdown.

I didn't gave much attention to testing vaapi, but it does seem works fine.
Comment 133 Daniel Vetter 2014-11-26 16:25:55 UTC
Please test this patch

http://patchwork.freedesktop.org/patch/37647/
Comment 134 Chris Wilson 2014-11-26 16:43:21 UTC
(In reply to Daniel Vetter from comment #133)
> Please test this patch
> 
> http://patchwork.freedesktop.org/patch/37647/

I've already tested that theory with Simon's testcase. It's another dead end.
Comment 135 Peter Frühberger 2014-11-26 18:04:38 UTC
Setting this to Assigned again as the main dev already tested the bits requested by danvet as non working.
Comment 136 dhead666 2014-12-01 20:56:19 UTC
Finally I encounter a hang with Chris Wilson's branch and kernel 3.17.4.

kernel: [drm] GPU HANG: ecode 7:0:0x87d3bffa, in chromium [15797], reason: Stuck on render ring, action: reset

My system froze after I reopened Chromium so I don't have the gpu crash dump.
Comment 137 Hugh Greenberg 2014-12-02 17:17:46 UTC
I think that it is possible that this is a different problem due to chromium and hardware acceleration. I have been running on two devices for 7 days straight without a single hang.

(In reply to dhead666 from comment #136)
> Finally I encounter a hang with Chris Wilson's branch and kernel 3.17.4.
> 
> kernel: [drm] GPU HANG: ecode 7:0:0x87d3bffa, in chromium [15797], reason:
> Stuck on render ring, action: reset
> 
> My system froze after I reopened Chromium so I don't have the gpu crash dump.
Comment 138 Chris Wilson 2014-12-05 08:07:10 UTC
*** Bug 78983 has been marked as a duplicate of this bug. ***
Comment 139 Chris Wilson 2014-12-06 16:38:29 UTC
*** Bug 87045 has been marked as a duplicate of this bug. ***
Comment 140 Chris Wilson 2014-12-10 07:50:41 UTC
*** Bug 87176 has been marked as a duplicate of this bug. ***
Comment 141 Chris Wilson 2014-12-10 20:56:27 UTC
Created attachment 110698 [details] [review]
Add extra flush flags for gen7 invalidate

A pair of patches that seem to do the trick...
Comment 142 Chris Wilson 2014-12-10 20:57:04 UTC
Created attachment 110699 [details] [review]
Keep GPU awake for context switches
Comment 143 Peter Frühberger 2014-12-11 06:46:05 UTC
I build Ubuntu kernel's with the two patches applied. After discussion with chris I left out the ringbuffer changes as those were not needed.

You can find the patches in my 3.18.0 tree on github.com/fritsch - the latest two of them.

Ubuntu kernel debs are here:
https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-ickle75%2B_3.18.0-ickle75%2B-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-ickle75%2B_3.18.0-ickle75%2B-10.00.Custom_amd64.deb

Happy testing.
Comment 144 Peter Frühberger 2014-12-12 07:08:36 UTC
Looking very good for now. We have ported the fix to kernel 3.17 and included it into OpenELEC.

For now we already have one promissing report. I will keep you informed. Thank you very much.
Comment 145 Nikola Šnele 2014-12-12 11:20:53 UTC
I can confirm that bug is fixed in Peter's kernel. No gpu hangs anymore :)
Comment 146 dhead666 2014-12-13 14:53:47 UTC
So far no hangs on my C720 Chromebook with the two patches from Peter's repo.

Thanks Chris for figuring it out, this bug was very annoying and disruptive.
Comment 147 Hugh Greenberg 2014-12-13 22:48:56 UTC
The fix is working great for me. Thank you very much Chris.
Comment 148 adr3nal1n 2014-12-14 11:47:57 UTC
Working really well here too! :-)  I used only the 2nd 3.17 kernel patch from Peter Frühberger (the one without the ringbuffer changes).

For reference I am running Arch Linux 3.17 x86_64 with a Haswell Pentium G3240 CPU. Am testing using XBMC 13.2 Gotham and so far have not had any gpu hangs, frame drops or skips during HD video playback. :-)

Thanks very much Chris Wilson for all your hard work on fixing this and to Peter Frühberger.

When do you think is it likely this patch may be added to the latest kernel at kernel.org? Sorry for asking but I am unfamiliar with how the Linux kernel patch submission process works.
Comment 149 adr3nal1n 2014-12-14 12:00:57 UTC
In addition to the above for reference, I am running the following arch linux x86_64 packages:

libva 1.4.1-1
libva-intel-driver 1.4.1-1
xf86-video-intel 2.99.916-3
mesa 10.3.5-1
mesa-dri 10.3.5-1
libxvmc 1.0.8-1

Hope this informations helps.
Comment 150 Peter Frühberger 2014-12-15 10:37:32 UTC
Tested and working very well. We are using a ported version in OpenELEC in the VAAPI testing thread.
Comment 151 dhead666 2014-12-16 16:41:32 UTC
I'm not sure if this is another issue or related to this one but even with the two patches from Peter's repo I'm still experiencing slowdowns and excessive use of RAM with Chromium.
Running Chromium with few tabs opened and another application that uses the GPU (like Kodi) will quicken the appearance of slowdown.
One might point the slowdowns source as Chromium's excessive use of RAM but I've got 4GB of it.

I'm experiencing this for a while but until now the "stuck on render ring" forced me to use i915 kernel parameters or the huge backport from Chris Wilson's development branch so I couldn't be sure this issue will be still exist after resolving the "stuck on render ring".

This is usually the output in journald:

systemd-coredump[19029]: Process 14662 (chromium) of user 1000 dumped core.
chromium.desktop[14628]: [19045:19046:1216/181255:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms.
kernel: Watchdog[19046]: segfault at 0 ip 00007f7cfe3e619b sp 00007f7ce78f75a0 error 6 in chromium[7f7cfa128000+6499000]
chromium.desktop[14628]: [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(437)] Failed to establish GPU channel.
systemd-coredump[19047]: Process 19045 (chromium) of user 1000 dumped core.
chromium.desktop[14628]: [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(461)] Lost UI shared context.
gnome-session[8166]: Window manager warning: last_focus_time (252848839) is greater than comparison timestamp (252820277).  This most likely represents a buggy client sending inaccurate timestamps in messages such as _NET_ACTIVE_WINDOW.  Trying to work around...
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 1 PID: 18930 at drivers/gpu/drm/i915/intel_pm.c:6585 intel_display_power_put+0x15c/0x170 [i915]()
kernel: Modules linked in: fuse ctr ccm ecb ath3k btusb bluetooth hid_logitech_dj usbhid hid nvram tpm_infineon snd_hda_codec_hdmi arc4 joydev ath9k mousedev cyapa ath9k_common ath9k_hw coretemp hwmon iTCO_wdt iTCO_vendor_support intel_rapl ath x86_pkg_temp_thermal intel_powerclamp mac80211 kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev crc32c_intel mac_hid snd_hda_codec_realtek chromeos_laptop snd_hda_codec_generic cfg80211 ghash_clmulni_intel cryptd pcspkr serio_raw i915 rfkill i2c_i801 snd_hda_intel shpchp lpc_ich snd_hda_controller fan ac tpm_tis battery snd_hda_codec tpm snd_hwdep drm_kms_helper i2c_designware_pci snd_pcm thermal dw_dmac_pci drm video snd_timer dw_dmac dw_dmac_core gpio_lynxpoint 8250_dw snd soundcore intel_gtt i2c_designware_platform i2c_algo_bit processor i2c_designware_core
kernel:  spi_pxa2xx_platform button uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media i2c_core sch_fq_codel ext4 crc16 mbcache jbd2 sd_mod atkbd libps2 i8042 serio sdhci_acpi sdhci led_class mmc_core ahci libahci libata scsi_mod xhci_pci xhci_hcd usbcore usb_common
kernel: CPU: 1 PID: 18930 Comm: kworker/1:0 Tainted: G        W      3.18.0-1-mainline #3
kernel: Hardware name: Acer Peppy, BIOS          10/18/2013
kernel: Workqueue: events edp_panel_vdd_work [i915]
kernel:  0000000000000000 0000000012e4cb13 ffff88003790fd28 ffffffff8154ecb4
kernel:  0000000000000000 0000000000000000 ffff88003790fd68 ffffffff81072bc1
kernel:  ffff88003790fd48 ffff88007b22002c 000000000000000b ffff88007b228810
kernel: Call Trace:
kernel:  [<ffffffff8154ecb4>] dump_stack+0x4e/0x71
kernel:  [<ffffffff81072bc1>] warn_slowpath_common+0x81/0xa0
kernel:  [<ffffffff81072cda>] warn_slowpath_null+0x1a/0x20
kernel:  [<ffffffffa03d3a3c>] intel_display_power_put+0x15c/0x170 [i915]
kernel:  [<ffffffffa044446d>] pps_unlock+0x3d/0x50 [i915]
kernel:  [<ffffffffa04480c9>] edp_panel_vdd_work+0x39/0x40 [i915]
kernel:  [<ffffffff8108b7c5>] process_one_work+0x145/0x400
kernel:  [<ffffffff8108bd8b>] worker_thread+0x6b/0x4a0
kernel:  [<ffffffff8108bd20>] ? init_pwq.part.22+0x10/0x10
kernel:  [<ffffffff81090dfa>] kthread+0xea/0x100
kernel:  [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0
kernel:  [<ffffffff8155477c>] ret_from_fork+0x7c/0xb0
kernel:  [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0
kernel: ---[ end trace a3c190b67c9fbfe4 ]---


Sometimes I also get this one:

kernel: [drm:ivybridge_set_fifo_underrun_reporting] *ERROR* uncleared fifo underrun on pipe A
Comment 152 Hugh Greenberg 2014-12-16 16:51:19 UTC
I don't know if this is the same issue or not, but I have noticed slow downs on the acer c720 related to swap and the disk cache. My solution has been to set the following in /etc/sysctl.conf:

vm.swappiness = 0
vm.dirty_background_bytes = 0
vm.dirty_bytes = 0
vm.dirty_ratio = 20
vm.dirty_background_ratio = 10
vm.dirty_writeback_centisecs = 500

I know you are on arch, so you may not have to do this, but I needed to change /usr/lib/pm-utils/power.d/laptop-mode such that I replaced the vmfiles variable with vmfiles="laptop_mode", otherwise the change would not be permanent.

So far this has worked for me to reduce slow downs.

(In reply to dhead666 from comment #151)
> I'm not sure if this is another issue or related to this one but even with
> the two patches from Peter's repo I'm still experiencing slowdowns and
> excessive use of RAM with Chromium.
> Running Chromium with few tabs opened and another application that uses the
> GPU (like Kodi) will quicken the appearance of slowdown.
> One might point the slowdowns source as Chromium's excessive use of RAM but
> I've got 4GB of it.
> 
> I'm experiencing this for a while but until now the "stuck on render ring"
> forced me to use i915 kernel parameters or the huge backport from Chris
> Wilson's development branch so I couldn't be sure this issue will be still
> exist after resolving the "stuck on render ring".
> 
> This is usually the output in journald:
> 
> systemd-coredump[19029]: Process 14662 (chromium) of user 1000 dumped core.
> chromium.desktop[14628]:
> [19045:19046:1216/181255:ERROR:gpu_watchdog_thread.cc(253)] The GPU process
> hung. Terminating after 10000 ms.
> kernel: Watchdog[19046]: segfault at 0 ip 00007f7cfe3e619b sp
> 00007f7ce78f75a0 error 6 in chromium[7f7cfa128000+6499000]
> chromium.desktop[14628]:
> [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(437)] Failed
> to establish GPU channel.
> systemd-coredump[19047]: Process 19045 (chromium) of user 1000 dumped core.
> chromium.desktop[14628]:
> [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(461)] Lost
> UI shared context.
> gnome-session[8166]: Window manager warning: last_focus_time (252848839) is
> greater than comparison timestamp (252820277).  This most likely represents
> a buggy client sending inaccurate timestamps in messages such as
> _NET_ACTIVE_WINDOW.  Trying to work around...
> kernel: ------------[ cut here ]------------
> kernel: WARNING: CPU: 1 PID: 18930 at drivers/gpu/drm/i915/intel_pm.c:6585
> intel_display_power_put+0x15c/0x170 [i915]()
> kernel: Modules linked in: fuse ctr ccm ecb ath3k btusb bluetooth
> hid_logitech_dj usbhid hid nvram tpm_infineon snd_hda_codec_hdmi arc4 joydev
> ath9k mousedev cyapa ath9k_common ath9k_hw coretemp hwmon iTCO_wdt
> iTCO_vendor_support intel_rapl ath x86_pkg_temp_thermal intel_powerclamp
> mac80211 kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev crc32c_intel
> mac_hid snd_hda_codec_realtek chromeos_laptop snd_hda_codec_generic cfg80211
> ghash_clmulni_intel cryptd pcspkr serio_raw i915 rfkill i2c_i801
> snd_hda_intel shpchp lpc_ich snd_hda_controller fan ac tpm_tis battery
> snd_hda_codec tpm snd_hwdep drm_kms_helper i2c_designware_pci snd_pcm
> thermal dw_dmac_pci drm video snd_timer dw_dmac dw_dmac_core gpio_lynxpoint
> 8250_dw snd soundcore intel_gtt i2c_designware_platform i2c_algo_bit
> processor i2c_designware_core
> kernel:  spi_pxa2xx_platform button uvcvideo videobuf2_vmalloc
> videobuf2_memops videobuf2_core v4l2_common videodev media i2c_core
> sch_fq_codel ext4 crc16 mbcache jbd2 sd_mod atkbd libps2 i8042 serio
> sdhci_acpi sdhci led_class mmc_core ahci libahci libata scsi_mod xhci_pci
> xhci_hcd usbcore usb_common
> kernel: CPU: 1 PID: 18930 Comm: kworker/1:0 Tainted: G        W     
> 3.18.0-1-mainline #3
> kernel: Hardware name: Acer Peppy, BIOS          10/18/2013
> kernel: Workqueue: events edp_panel_vdd_work [i915]
> kernel:  0000000000000000 0000000012e4cb13 ffff88003790fd28 ffffffff8154ecb4
> kernel:  0000000000000000 0000000000000000 ffff88003790fd68 ffffffff81072bc1
> kernel:  ffff88003790fd48 ffff88007b22002c 000000000000000b ffff88007b228810
> kernel: Call Trace:
> kernel:  [<ffffffff8154ecb4>] dump_stack+0x4e/0x71
> kernel:  [<ffffffff81072bc1>] warn_slowpath_common+0x81/0xa0
> kernel:  [<ffffffff81072cda>] warn_slowpath_null+0x1a/0x20
> kernel:  [<ffffffffa03d3a3c>] intel_display_power_put+0x15c/0x170 [i915]
> kernel:  [<ffffffffa044446d>] pps_unlock+0x3d/0x50 [i915]
> kernel:  [<ffffffffa04480c9>] edp_panel_vdd_work+0x39/0x40 [i915]
> kernel:  [<ffffffff8108b7c5>] process_one_work+0x145/0x400
> kernel:  [<ffffffff8108bd8b>] worker_thread+0x6b/0x4a0
> kernel:  [<ffffffff8108bd20>] ? init_pwq.part.22+0x10/0x10
> kernel:  [<ffffffff81090dfa>] kthread+0xea/0x100
> kernel:  [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0
> kernel:  [<ffffffff8155477c>] ret_from_fork+0x7c/0xb0
> kernel:  [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0
> kernel: ---[ end trace a3c190b67c9fbfe4 ]---
> 
> 
> Sometimes I also get this one:
> 
> kernel: [drm:ivybridge_set_fifo_underrun_reporting] *ERROR* uncleared fifo
> underrun on pipe A
Comment 153 Jani Nikula 2014-12-17 15:03:52 UTC
commit add284a3a2481e759d6bec35f6444c32c8ddc383
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 16 08:44:32 2014 +0000

    drm/i915: Force the CS stall for invalidate flushes

and

commit 2c550183476dfa25641309ae9a28d30feed14379
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Dec 16 10:02:27 2014 +0000

    drm/i915: Disable PSMI sleep messages on all rings around context switches

in drm-intel-next-fixes.
Comment 154 dhead666 2014-12-17 19:01:52 UTC
@Hugh Greenberg, thanks but I'm not using swap.

I think I'll try to gather more info, properly compare against other GPUs and might try some of the tests in intel-gpu-tools before opening a separate issue on the matter.

Anyway, this should be discussed somewhere else so if you've anything else to share you're welcome to do this by email, G+ or even ping me at irc (dhead666@freenode).
Comment 155 Chris Wilson 2014-12-22 11:13:18 UTC
*** Bug 87571 has been marked as a duplicate of this bug. ***
Comment 156 Chris Wilson 2015-01-05 09:38:45 UTC
*** Bug 88017 has been marked as a duplicate of this bug. ***
Comment 157 Chris Wilson 2015-01-05 09:53:53 UTC
*** Bug 88044 has been marked as a duplicate of this bug. ***
Comment 158 Chris Wilson 2015-01-18 19:10:26 UTC
*** Bug 88341 has been marked as a duplicate of this bug. ***
Comment 159 Chris Wilson 2015-01-20 09:20:14 UTC
*** Bug 88612 has been marked as a duplicate of this bug. ***
Comment 160 Chris Wilson 2015-01-20 09:40:35 UTC
*** Bug 88604 has been marked as a duplicate of this bug. ***
Comment 161 Chris Wilson 2015-01-28 09:04:20 UTC
*** Bug 88839 has been marked as a duplicate of this bug. ***
Comment 162 Chris Wilson 2015-02-06 16:13:54 UTC
*** Bug 89010 has been marked as a duplicate of this bug. ***
Comment 163 Chris Wilson 2015-02-07 21:15:28 UTC
*** Bug 89025 has been marked as a duplicate of this bug. ***
Comment 164 Chris Wilson 2015-02-10 20:55:43 UTC
*** Bug 89065 has been marked as a duplicate of this bug. ***
Comment 165 Chris Wilson 2015-02-11 20:48:31 UTC
*** Bug 89089 has been marked as a duplicate of this bug. ***
Comment 166 Chris Wilson 2015-02-18 12:04:57 UTC
*** Bug 89183 has been marked as a duplicate of this bug. ***
Comment 167 Chris Wilson 2015-03-11 15:28:26 UTC
*** Bug 89531 has been marked as a duplicate of this bug. ***
Comment 168 Chris Wilson 2015-03-28 12:17:38 UTC
*** Bug 89799 has been marked as a duplicate of this bug. ***
Comment 169 Chris Wilson 2015-04-09 08:32:04 UTC
*** Bug 89964 has been marked as a duplicate of this bug. ***
Comment 170 Chris Wilson 2015-04-24 19:41:48 UTC
*** Bug 90165 has been marked as a duplicate of this bug. ***
Comment 171 Chris Wilson 2015-05-18 20:17:41 UTC
*** Bug 90509 has been marked as a duplicate of this bug. ***
Comment 172 Chris Wilson 2015-05-26 08:02:00 UTC
*** Bug 90635 has been marked as a duplicate of this bug. ***
Comment 173 Chris Wilson 2015-05-26 14:24:06 UTC
*** Bug 90659 has been marked as a duplicate of this bug. ***
Comment 174 Winni 2015-05-27 10:11:53 UTC
Hi,

what did the trick with my Haswell Celeron 2955U:

In /etc/default/grub 

I added to the line GRUB_CMDLINE_LINUX_DEFAULT=""

kernel parameters I found in this thread so that the whole line looks like that:

GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0"

Sine 10 days no gpu hangs in my ubuntu system with Celeron 2995U 
Thanks to all.
Comment 175 Chris Wilson 2015-05-28 21:17:24 UTC
*** Bug 90729 has been marked as a duplicate of this bug. ***
Comment 176 Chris Wilson 2015-06-19 08:29:06 UTC
*** Bug 91024 has been marked as a duplicate of this bug. ***
Comment 177 Chris Wilson 2015-06-29 13:49:47 UTC
*** Bug 91144 has been marked as a duplicate of this bug. ***
Comment 178 Dave 2015-08-06 15:30:48 UTC
I tried the following as mentioned by Winni:

GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0"

The screen timed out as usual (still able to move the mouse cursor around, resize windows, see the cursor change, etc.), but this time it never came back.  I even tried "ps aux | grep compiz" and "kill -9 ####" on compiz and compiz-decorator, which usually causes it to reload, but I ended up having to restart the computer as I didn't know what else to try.

I'm reopening the bug for this reason.  If you feel I shouldn't, then perhaps I should reopen my original here: https://bugs.freedesktop.org/show_bug.cgi?id=90659 - let me know.  If there's anything I can do to assist in finding a solution, don't hesitate to ask!

Thanks,
Dave
Comment 179 Chris Wilson 2015-08-06 15:38:38 UTC
(In reply to Dave from comment #178)
> I tried the following as mentioned by Winni:
> 
> GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1
> i915.semaphores=0"

This bug has nothing to do with semaphores. Just update your kernel.
Comment 180 Chris Wilson 2015-09-09 09:05:48 UTC
*** Bug 91932 has been marked as a duplicate of this bug. ***
Comment 181 Chris Wilson 2015-09-10 08:20:19 UTC
*** Bug 91955 has been marked as a duplicate of this bug. ***
Comment 182 Chris Wilson 2015-10-23 17:23:00 UTC
*** Bug 92647 has been marked as a duplicate of this bug. ***
Comment 183 Chris Wilson 2015-11-01 09:30:01 UTC
*** Bug 92763 has been marked as a duplicate of this bug. ***
Comment 184 Chris Wilson 2016-01-18 20:34:02 UTC
*** Bug 93756 has been marked as a duplicate of this bug. ***
Comment 185 Chris Wilson 2016-04-23 13:50:40 UTC
*** Bug 95084 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.