Summary: | [HSW gt1] GPU HANG: ecode 0:0x87d3bffa on ctx load | ||
---|---|---|---|
Product: | DRI | Reporter: | Simon Farnsworth <simon> |
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | blocker | ||
Priority: | highest | CC: | absolute, andy.pickin, anomaly256, bugs-freedesktop, byoungchan.lee.public, cdl85281, daaaans, daniel, davies.t.o, djsorinel, dslunjski, dvereb, fan4326, fernetmenta, freedesktop.org, fritsch, ftoth, gary.c.wang, hazinct, hgondalf, hugegreenbug, intel-gfx-bugs, jagduley, javi, jim, jonathan.cox.c, kramaphone, lists.jjorge, luming.yu, mactalla.obair, martin.x.andersen, maximsch2, maxwei, mokey_fraggle, morebikethanman, myfoolishgames, nbelavic, nemesis, node1011, n.schnelle, ola.redell, pgh.nunes, r4p5w7, rain.opik, rdieter, redhootcp, tbl0605, thomas, tournieral, tsal, windose |
Version: | XOrg git | ||
Hardware: | All | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | ALL | i915 features: | |
Attachments: |
Description
Simon Farnsworth
2014-09-09 15:18:08 UTC
Created attachment 105992 [details]
Error state collected during hang
I can repro this reliably, with only X11 and the compositor accessing the GPU; the application drawing (Adobe Flash in the repro case) is using X11 to draw. Moving to xorg-x11-drv-intel-2.99.916-2.fc21 with SNA instead of UXA didn't help, nor did enabling triple buffering. It's the switch into the compositors context is where it dies (and in all the similar bugs it is the switch into the GL context). One hypothesis is that something in the context state saved from GL is corrupt or plain invalid upon restore. An alternative shotgun: http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests (In reply to comment #3) > It's the switch into the compositors context is where it dies (and in all > the similar bugs it is the switch into the GL context). One hypothesis is > that something in the context state saved from GL is corrupt or plain > invalid upon restore. An alternative shotgun: > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests Applying the shotgun fixed it - at least one of the pellets must have hit the bug between the eyes. How do you want me to proceed from here? I need a patchset that meets the rules for stable kernels (I'm trying to stick as closely as possible to Fedora's kernels here). I've got the usual tools to hand (repro case, git, RPM building tools etc), so can work with you to get a suitable tested patchset. That's a bit scary then. Which commit did I point you at so I can find the parent drm-intel-nightly commit (to narrow the shotgun down a bit)? I built and tested: : sfarnsworth host64 $ git show commit 3a5e1e6176fb61735a98f16a80c756b3cc69f125 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Aug 24 19:34:16 2014 +0100 drm/i915: Convert a couple more INTEL_INFO-esque macros to be pointer agnostic Just a couple more macros that assume that they were being passed a struct drm_device when they want a struct drm_i915_private. Use our magic macro to ease transitioning over to using drm_i915_privates Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> diff --git a/drivers/gpu/drm/i915/i915_drv.h b/drivers/gpu/drm/i915/i915_drv.h index 5cadfa5..d1678e2 100644 --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2046,7 +2046,7 @@ struct drm_i915_cmd_table { #define HAS_VEBOX(dev) (INTEL_INFO(dev)->ring_mask & VEBOX_RING) #define HAS_LLC(dev) (INTEL_INFO(dev)->has_llc) #define HAS_WT(dev) ((IS_HASWELL(dev) || IS_BROADWELL(dev)) && \ - to_i915(dev)->ellc_size) + __I915__(dev)->ellc_size) #define I915_NEED_GFX_HWS(dev) (INTEL_INFO(dev)->need_gfx_hws) #define HAS_HW_CONTEXTS(dev) (INTEL_INFO(dev)->gen >= 5) @@ -2100,7 +2100,7 @@ struct drm_i915_cmd_table { #define INTEL_PCH_LPT_DEVICE_ID_TYPE 0x8c00 #define INTEL_PCH_LPT_LP_DEVICE_ID_TYPE 0x9c00 -#define INTEL_PCH_TYPE(dev) (to_i915(dev)->pch_type) +#define INTEL_PCH_TYPE(dev) (__I915__(dev)->pch_type) #define HAS_PCH_LPT(dev) (INTEL_PCH_TYPE(dev) == PCH_LPT) #define HAS_PCH_CPT(dev) (INTEL_PCH_TYPE(dev) == PCH_CPT) #define HAS_PCH_IBX(dev) (INTEL_PCH_TYPE(dev) == PCH_IBX) The first commit to check is then 257d90d13794c2eb545ab0d6c708f21e2a0378b6. That will tell us if the fix is in my shotgun branch or upstream. My guess is that it is in this branch, in which case you have two points from which to start bisecting. I have a few guesses, it might well be one of the minor patches... (In reply to comment #7) > The first commit to check is then 257d90d13794c2eb545ab0d6c708f21e2a0378b6. > That will tell us if the fix is in my shotgun branch or upstream. My guess > is that it is in this branch, in which case you have two points from which > to start bisecting. I have a few guesses, it might well be one of the minor > patches... That commit did not work - I get my GPU hangs. I'll start bisecting. I think I done wrong. It looks like I tried your *master* branch, not your *requests* branch, and bisect won't work: : sfarnsworth host64 $ git bisect start : sfarnsworth host64 $ git bisect good 3a5e1e6176fb61735a98f16a80c756b3cc69f125 : sfarnsworth host64 $ git bisect bad 257d90d13794c2eb545ab0d6c708f21e2a0378b6 Some good revs are not ancestor of the bad rev. git bisect cannot work properly in this case. Maybe you mistake good and bad revs? (In reply to comment #9) > I think I done wrong. It looks like I tried your *master* branch, not your > *requests* branch, and bisect won't work: > > : sfarnsworth host64 $ git bisect start > > : sfarnsworth host64 $ git bisect good > 3a5e1e6176fb61735a98f16a80c756b3cc69f125 > > : sfarnsworth host64 $ git bisect bad > 257d90d13794c2eb545ab0d6c708f21e2a0378b6 > Some good revs are not ancestor of the bad rev. > git bisect cannot work properly in this case. > Maybe you mistake good and bad revs? It's just that git is very ethical and doesn't have a loose definition of good and bad that we do. In its opinion old code is always good and bugs are only ever introduced. To get around this you have to do a "reverse git bisect" and declare good as bad and vice versa. i.e. git bisect start git bisect good 257d90d13794c2eb545ab0d6c708f21e2a0378b6 git bisect bad 3a5e1e6176fb61735a98f16a80c756b3cc69f125 then hang -> git bisect good, working -> git bisect bad. I wish git bisect had a switch for that so that you didn't have to run the risk of mixing up good/bad on each step. I used "git bisect good" for GPU hangs, "git bisect bad" for "it works", and "git bisect skip" for "compiler says no kernel for you." Assuming no mistakes, I get: : sfarnsworth host64 $ git bisect log git bisect start # good: [257d90d13794c2eb545ab0d6c708f21e2a0378b6] drm-intel-nightly: 2014y-08m-21d-10h-03m-09s integration manifest git bisect good 257d90d13794c2eb545ab0d6c708f21e2a0378b6 # bad: [3a5e1e6176fb61735a98f16a80c756b3cc69f125] drm/i915: Convert a couple more INTEL_INFO-esque macros to be pointer agnostic git bisect bad 3a5e1e6176fb61735a98f16a80c756b3cc69f125 # skip: [9abf49b9962e1fe5d30ac1cf32e8cc2272d531c4] intel-gtt: Report stolen_size as 0 when local memory is present git bisect skip 9abf49b9962e1fe5d30ac1cf32e8cc2272d531c4 # skip: [30b824d88baa6b1a23e189c3c06ecf32e8cf0cbf] drm/i915: Reduce number of register access during IVB+ interrupt handling git bisect skip 30b824d88baa6b1a23e189c3c06ecf32e8cf0cbf # skip: [d33c3d9e218a8c96e6a15cc4b558b2b7780fe134] drm/i915: Check the minimum pitch for the user framebuffer git bisect skip d33c3d9e218a8c96e6a15cc4b558b2b7780fe134 # skip: [dfd9d929b9a66d5ed9bfffc0335fc11293451290] drm/i915/sdvo: Fix LVDS connector status detection git bisect skip dfd9d929b9a66d5ed9bfffc0335fc11293451290 # bad: [20ae302941850d0b3e00f6cbdc88d2824585f112] drm/i915: Improved w/a for rps on Baytrail git bisect bad 20ae302941850d0b3e00f6cbdc88d2824585f112 # good: [555633d6527465a77845a9d705cd2075ccbdeef0] drm/i915: Remove DRI1 ring accessors and API git bisect good 555633d6527465a77845a9d705cd2075ccbdeef0 # bad: [01094f706a41793d8708592e1925960370f83e05] drm/i915: Decouple the stuck pageflip on modeset git bisect bad 01094f706a41793d8708592e1925960370f83e05 # bad: [03e2e353953fdd6627a0864be0e3c223762bd85c] drm/i915: Prevent recursive deadlock on releasing a busy userptr git bisect bad 03e2e353953fdd6627a0864be0e3c223762bd85c # skip: [6196a504b501a7e3ed6e740913243c2d2d070c21] drm/i915: Renames variables and functions that act upon intel_engine_cs git bisect skip 6196a504b501a7e3ed6e740913243c2d2d070c21 # bad: [6fd4781d6c60795ab43180cdc081532054214fe7] drm/i915: s/seqno/request/ tracking inside objects git bisect bad 6fd4781d6c60795ab43180cdc081532054214fe7 # only skipped commits left to test # possible first bad commit: [6fd4781d6c60795ab43180cdc081532054214fe7] drm/i915: s/seqno/request/ tracking inside objects # possible first bad commit: [6196a504b501a7e3ed6e740913243c2d2d070c21] drm/i915: Renames variables and functions that act upon intel_engine_cs That indicates the shotgun helps. :| Oh well. I've updated the shotgun at #requests. It's been reworked quite a bit since then, and I need to double check that it still applies. I think I have a germ of a theory as to what is going wrong. (In reply to comment #13) > I've updated the shotgun at #requests. It's been reworked quite a bit since > then, and I need to double check that it still applies. I think I have a > germ of a theory as to what is going wrong. I'm now testing teh requests branch, as of commit da0c726483f60d4f53de49a4a2753a1d95983bd9 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Sep 18 12:54:55 2014 +0100 Revert "drm/i915: Enable full PPGTT on gen7" This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1. This gets me a new bit of excitement - when I start X for the first time, the log file says: [ 72.039] (WW) xf86OpenConsole: setpgid failed: Operation not permitted [ 72.039] (WW) xf86OpenConsole: setsid failed: Operation not permitted [ 72.039] (EE) Fatal server error: [ 72.039] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error [ 72.039] (EE) [ 72.039] (EE) which it didn't before. Second attempt to start X works fine. I also get a new message (but not reliably) in dmesg: [ 255.476493] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle And I've had the machine freeze completely while restarting X11. Created attachment 106501 [details] [review] Make the context switch+dispatch uninterruptible This should test my theory that is a signal between setting the context and executing the batch that is causing the error. Slightly too coarse, but it should point if I am in the right direction. (Still would like confirmation on the current #requests shotgun :) (In reply to comment #14) > (In reply to comment #13) > > I've updated the shotgun at #requests. It's been reworked quite a bit since > > then, and I need to double check that it still applies. I think I have a > > germ of a theory as to what is going wrong. > > I'm now testing teh requests branch, as of > > commit da0c726483f60d4f53de49a4a2753a1d95983bd9 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Sep 18 12:54:55 2014 +0100 > > Revert "drm/i915: Enable full PPGTT on gen7" > > This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1. > > This gets me a new bit of excitement - when I start X for the first time, > the log file says: > > [ 72.039] (WW) xf86OpenConsole: setpgid failed: Operation not permitted > [ 72.039] (WW) xf86OpenConsole: setsid failed: Operation not permitted > [ 72.039] (EE) > Fatal server error: > [ 72.039] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error > [ 72.039] (EE) > [ 72.039] (EE) > > which it didn't before. Second attempt to start X works fine. Right, there is a nasty bug in the vt layer somewhere. The branch contains a patch to return -EIO to prevent a lockup. But you should see that regardless, just depends on kernel config. > I also get a new message (but not reliably) in dmesg: > > [ 255.476493] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer > elapsed... blitter ring idle > > And I've had the machine freeze completely while restarting X11. These two are more concerning. On HSW. Hmm. FWIW, those are both reproduceable. I get the hangcheck message first, then later an X11 restart will take out the system (no local or remote access works). (In reply to comment #15) > Created attachment 106501 [details] [review] [review] > Make the context switch+dispatch uninterruptible > > This should test my theory that is a signal between setting the context and > executing the batch that is causing the error. Slightly too coarse, but it > should point if I am in the right direction. > > (Still would like confirmation on the current #requests shotgun :) A base 3.17-rc5 with this patch applied has GPU hangs. I've grabbed the error state if it would be interesting. (In reply to comment #18) > (In reply to comment #15) > > Created attachment 106501 [details] [review] [review] [review] > > Make the context switch+dispatch uninterruptible > A base 3.17-rc5 with this patch applied has GPU hangs. I've grabbed the > error state if it would be interesting. Please do, I expect it to be the same error, but we should check anyway. Created attachment 106508 [details] The error state after applying the patch from comment #15 (In reply to comment #20) > Created attachment 106508 [details] > The error state after applying the patch from comment #15 For the record, it is the same bug. If you have time, could you checkout c5cddc3c051057c11ea739744ed03d284ce0d0f3^ and see if that starts up ok? (Also if you have a netconsole for grabbing the oops from that lockup that would be very useful.) Maybe: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index ad55b06a3cb1..9509f04c57b6 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1351,10 +1351,8 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj, mutex_unlock(&dev->struct_mutex); ret = __wait_seqno(ring, seqno, reset_counter, true, NULL, file_priv); mutex_lock(&dev->struct_mutex); - if (ret) - return ret; - return i915_gem_object_wait_rendering__tail(obj, ring); + return 0; } /** Or rather: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index ad55b06a3cb1..97089c392094 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1351,10 +1351,8 @@ i915_gem_object_wait_rendering__nonblocking(struct drm_i915_gem_object *obj, mutex_unlock(&dev->struct_mutex); ret = __wait_seqno(ring, seqno, reset_counter, true, NULL, file_priv); mutex_lock(&dev->struct_mutex); - if (ret) - return ret; - return i915_gem_object_wait_rendering__tail(obj, ring); + return ret; } /** (In reply to comment #22) > If you have time, could you checkout > c5cddc3c051057c11ea739744ed03d284ce0d0f3^ and see if that starts up ok? > (Also if you have a netconsole for grabbing the oops from that lockup that > would be very useful.) c5cddc3c051057c11ea739744ed03d284ce0d0f3^ starts up. netconsole gives me: [ 271.900969] ------------[ cut here ]------------ [ 271.901001] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:130! [ 271.901026] invalid opcode: 0000 [#1] SMP [ 271.901048] Modules linked in: netconsole dummy nf_conntrack_ipv4 ip6t_REJECT nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack cfg80211 nf_conntrack ip6table_filter ip6_tables rfkill snd_dummy x86_pkg_temp_thermal coretemp snd_hda_codec_realtek kvm_intel snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_controller snd_hda_codec kvm snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer crct10dif_pclmul crc32_pclmul iTCO_wdt mei_me mei crc32c_intel iTCO_vendor_support snd ghash_clmulni_intel mxm_wmi tpm_tis lpc_ich tpm r8169 serio_raw pcspkr mii mfd_core i2c_i801 microcode soundcore wmi shpchp i915 i2c_algo_bit drm_kms_helper drm video [ 271.901506] CPU: 0 PID: 1602 Comm: screen_manager Not tainted 3.17.0-rc5+ #9 [ 271.901533] Hardware name: ONELAN MS-7851/B85I (MS-7851), BIOS V3.5 05/30/2014 [ 271.901560] task: ffff8800d43d4a00 ti: ffff8800a02f4000 task.ti: ffff8800a02f4000 [ 271.901588] RIP: 0010:[<ffffffffa00a166d>] [<ffffffffa00a166d>] i915_gem_object_retire__read+0x16d/0x170 [i915] [ 271.901650] RSP: 0018:ffff8800a02f7c78 EFLAGS: 00010246 [ 271.901671] RAX: ffff8800d475e900 RBX: ffff8800362e18e0 RCX: dead000000200200 [ 271.901698] RDX: 0000000000000140 RSI: ffff8800362e18e0 RDI: ffff8800d2fa26c0 [ 271.901724] RBP: ffff8800a02f7ca0 R08: ffff8800a02594f8 R09: ffff88011da173c0 [ 271.901750] R10: ffffea000351d780 R11: ffffffffa00b19d8 R12: ffff8800362e1a90 [ 271.901777] R13: 0000000000000001 R14: ffff8800362e0000 R15: ffff8800d2fa26c0 [ 271.901803] FS: 00007fe6bebc0700(0000) GS:ffff88011da00000(0000) knlGS:0000000000000000 [ 271.901833] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 271.901855] CR2: 00007fe6b8097138 CR3: 00000000ace05000 CR4: 00000000000407f0 [ 271.901881] Stack: [ 271.901892] ffff8800362e18e0 ffff8800362e1a90 0000000000000001 ffff8800362e0000 [ 271.901926] ffff8800d50e4078 ffff8800a02f7cc0 ffffffffa00a1ea8 ffff8800362e18e0 [ 271.901960] 0000000000000005 ffff8800a02f7cf0 ffffffffa00a1f99 0000000000000000 [ 271.901994] Call Trace: [ 271.902091] [<ffffffffa00a1ea8>] i915_gem_retire_requests__engine+0x58/0x110 [i915] [ 271.902133] [<ffffffffa00a1f99>] i915_gem_retire_requests+0x39/0x90 [i915] [ 271.902172] [<ffffffffa00a209d>] i915_gem_object_retire+0xad/0x220 [i915] [ 271.902212] [<ffffffffa00a2241>] i915_gem_object_wait_rendering.part.36+0x31/0x70 [i915] [ 271.902253] [<ffffffffa00a3574>] i915_gem_object_set_to_cpu_domain+0x84/0x1d0 [i915] [ 271.902293] [<ffffffffa00a39a5>] i915_gem_set_domain_ioctl+0x115/0x140 [i915] [ 271.902328] [<ffffffffa00139ac>] drm_ioctl+0x1ec/0x660 [drm] [ 271.902354] [<ffffffff8120aff0>] do_vfs_ioctl+0x2e0/0x4a0 [ 271.902376] [<ffffffff8120b231>] SyS_ioctl+0x81/0xa0 [ 271.902399] [<ffffffff81722129>] system_call_fastpath+0x16/0x1b [ 271.902422] Code: ff e8 88 0c f7 ff 5b 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 ff e8 35 fc ff ff e9 30 ff ff ff 4c 89 ff e8 68 fb ff ff e9 39 ff ff ff <0f> 0b 90 0f 1f 44 00 00 55 48 89 e5 53 48 8b 47 28 48 89 fb 48 [ 271.902941] RIP [<ffffffffa00a166d>] i915_gem_object_retire__read+0x16d/0x170 [i915] [ 271.902985] RSP <ffff8800a02f7c78> [ 271.939684] ---[ end trace 5bc289903bbf7885 ]--- [ 271.939688] Kernel panic - not syncing: Fatal exception [ 271.939715] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff) [ 271.939754] drm_kms_helper: panic occurred, switching back to text console Created attachment 106519 [details] Error state after patch from comment #24 is applied I applied the patch from #24 to Linus's tree, and got GPU hangs (see attached error). Wrong tree? I'm not going to be able to do more tonight - 2 year old is interested in what I'm doing. (In reply to comment #26) > Created attachment 106519 [details] > Error state after patch from comment #24 is applied > > I applied the patch from #24 to Linus's tree, and got GPU hangs (see > attached error). Wrong tree? That's fine. It was just a stab in the dark. As for the BUG() the assert looks valid, but I haven't seen how it could end up there. Oh well. Could you get drm.debug=7 dmesg for the BUG()? I don't it will give anything else, but maybe it will have a nugget of gold in there. Best would slub debug=y (use-after-free checks) or even kmemcheck. Still scratching my head over that BUG(). I've splattered a few more into #requests, if you could be so kind as to see if that changes the oops. Meanwhile, current theory is that maybe it is the CS programming around the ctx switch that is the significant change in the shotgun. Still thinking. (In reply to comment #28) > Could you get drm.debug=7 dmesg for the BUG()? I don't it will give anything > else, but maybe it will have a nugget of gold in there. Best would slub > debug=y (use-after-free checks) or even kmemcheck. I've turned on slub debug, but drm.debug=7 dmesg flows too fast to send over netconsole, with lots of repeats of: [ 164.320029] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [ 164.320055] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_BUSY [ 164.320056] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE [ 164.320060] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE [ 164.320062] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE [ 164.320064] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE [ 164.320066] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_BUSY [ 164.320067] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_MADVISE [ 164.320069] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_PWRITE [ 164.320071] [drm:drm_ioctl] pid=888, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 If I hit the BUG() again, I'll give you whatever I can get. (In reply to comment #29) > Still scratching my head over that BUG(). I've splattered a few more into > #requests, if you could be so kind as to see if that changes the oops. > > Meanwhile, current theory is that maybe it is the CS programming around the > ctx switch that is the significant change in the shotgun. Still thinking. I'm now testing your new #requests branch, as of commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Sep 18 14:27:36 2014 +0100 Revert "drm/i915: Enable full PPGTT on gen7" This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1. I'll let you know the results on Monday. (In reply to comment #31) > (In reply to comment #29) > > Still scratching my head over that BUG(). I've splattered a few more into > > #requests, if you could be so kind as to see if that changes the oops. > > > > Meanwhile, current theory is that maybe it is the CS programming around the > > ctx switch that is the significant change in the shotgun. Still thinking. > > I'm now testing your new #requests branch, as of > > commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Sep 18 14:27:36 2014 +0100 > > Revert "drm/i915: Enable full PPGTT on gen7" > > This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1. > > I'll let you know the results on Monday. This commit, running with i915.enable_rc6=0 i915.enable_fbc=0 slub_debug drm.debug=7, has not failed on me. [229832.518684] [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 7 p(0,-41)@ 230045.960557 -> 230045.961197 [e 0 us, 0 rep] [229832.518741] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_RMFB [229832.518744] [drm:__drm_framebuffer_unreference] ffff8800d48fc820: FB ID: 0 (2) [229832.518748] [drm:drm_framebuffer_unreference] ffff8800d48fc820: FB ID: 0 (1) [229832.518771] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.518861] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.518932] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.518998] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.519016] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, DRM_IOCTL_GEM_OPEN [229832.519031] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_GET_TILING [229832.519036] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN [229832.519083] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SW_FINISH [229832.519087] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.519214] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, DRM_IOCTL_GEM_CLOSE [229832.519229] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.519233] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_BUSY [229832.519234] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.519236] [drm:drm_ioctl] pid=2692, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN [229832.519302] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC [229832.519336] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_WAIT_VBLANK [229832.519340] [drm:drm_wait_vblank] waiting on vblank count 13778948, crtc 0 [229832.519342] [drm:drm_wait_vblank] returning 13778948 to client [229832.519347] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC [229832.519367] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_ADDFB [229832.519384] [drm:drm_framebuffer_reference] ffff8800d48fc820: FB ID: 56 (1) [229832.519387] [drm:drm_mode_addfb] [FB:56] [229832.519390] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_GETCRTC [229832.519405] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, DRM_IOCTL_MODE_PAGE_FLIP [229832.519408] [drm:drm_framebuffer_reference] ffff8800d48fc820: FB ID: 56 (2) [229832.519446] [drm:drm_framebuffer_unreference] ffff8800d48fcc30: FB ID: 57 (3) [229832.519498] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.519535] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.527171] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527176] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.527184] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.527192] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527194] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN [229832.527348] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527352] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.527433] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [229832.527482] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527483] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527495] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527497] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.527504] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.527508] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527510] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_DOMAIN [229832.527767] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527770] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527772] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527773] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527775] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527779] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.527853] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [229832.527873] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527875] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527877] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527885] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527887] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527898] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527900] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527902] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.527904] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.527906] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.527909] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.527972] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [229832.528004] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.528006] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.528008] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.528012] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.528074] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.528161] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [229832.528218] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.528226] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.529757] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.529761] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.529770] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_SET_TILING [229832.529794] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_PWRITE [229832.529798] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [229832.529868] [drm:legacy_ringbuffer_submission] UXA submitting garbage DR4, fixing up [229832.529922] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_MADVISE [229832.529943] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_BUSY [229832.529949] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.529997] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.530068] [drm:drm_ioctl] pid=2671, dev=0xe200, auth=1, I915_GEM_THROTTLE [229832.535331] [drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 7 p(0,-41)@ 230045.977219 -> 230045.977860 [e 0 us, 0 rep] is a single frame's worth of dmesg output. I'm going to remove drm.debug=7 and retest. (In reply to comment #31) > (In reply to comment #29) > > Still scratching my head over that BUG(). I've splattered a few more into > > #requests, if you could be so kind as to see if that changes the oops. > > > > Meanwhile, current theory is that maybe it is the CS programming around the > > ctx switch that is the significant change in the shotgun. Still thinking. > > I'm now testing your new #requests branch, as of > > commit cdd8594d0f84e06f99cdd1e5b823b844c4249f6b > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Sep 18 14:27:36 2014 +0100 > > Revert "drm/i915: Enable full PPGTT on gen7" > > This reverts commit 83255c23abe91da047dc71e52be62c42dd4c04a1. > > I'll let you know the results on Monday. Without drm.debug=7, this gives me the VT race that ends in X logging: [ 79.069] (++) using VT number 1 [ 79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted [ 79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted [ 79.070] (EE) Fatal server error: [ 79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error (In reply to comment #33) > Without drm.debug=7, this gives me the VT race that ends in X logging: > > [ 79.069] (++) using VT number 1 > > [ 79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted > [ 79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted > [ 79.070] (EE) > Fatal server error: > [ 79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error Not my fault! It's a race entirely in the VT layer. It's been plaguing my machines for many months but I haven't decyphered enough of the VT code to understand what it is trying to do, let alone why it is failing. If you try to start X again, it will work - it's purely a timing issue afaict. Would be good to know if the machine runs stable without the drm.debug, and what happens without i915.enable_rc6=0. (In reply to comment #34) > (In reply to comment #33) > > Without drm.debug=7, this gives me the VT race that ends in X logging: > > > > [ 79.069] (++) using VT number 1 > > > > [ 79.070] (WW) xf86OpenConsole: setpgid failed: Operation not permitted > > [ 79.070] (WW) xf86OpenConsole: setsid failed: Operation not permitted > > [ 79.070] (EE) > > Fatal server error: > > [ 79.070] (EE) xf86OpenConsole: VT_ACTIVATE failed: Input/output error > > Not my fault! It's a race entirely in the VT layer. It's been plaguing my > machines for many months but I haven't decyphered enough of the VT code to > understand what it is trying to do, let alone why it is failing. > > If you try to start X again, it will work - it's purely a timing issue > afaict. > > Would be good to know if the machine runs stable without the drm.debug, and > what happens without i915.enable_rc6=0. Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0, and the GPU hangs happen again. Created attachment 106669 [details]
error state gzip with #requests
Hang with #requests, kernel cmd line has i915.enable_fbc=0:
[ 1060.729567] [drm] GPU HANG: ecode 0:0x87d3bffa, in screen_manager [3142], reason: Ring hung, action: reset
[ 1060.729572] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1060.729574] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1060.729576] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1060.729578] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1060.729580] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1072.717358] [drm] stuck on render ring
[ 1072.719064] [drm] GPU HANG: ecode 0:0x87d3bffa, in screen_manager [3142], reason: Ring hung, action: reset
(In reply to comment #35) > Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0, > and the GPU hangs happen again. (In reply to comment #36) > Hang with #requests, kernel cmd line has i915.enable_fbc=0: Just for the sake of my sanity, can you confirm the command line settings used that resulted in the hang? For the record that last hang wasn't with my requests branch. (I could be in for a beating.) (In reply to comment #37) > (In reply to comment #35) > > Shiny. I run without drm.debug, and I'm stable. I remove i915.enable_rc6=0, > > and the GPU hangs happen again. > > (In reply to comment #36) > > Hang with #requests, kernel cmd line has i915.enable_fbc=0: > > Just for the sake of my sanity, can you confirm the command line settings > used that resulted in the hang? # cat /proc/cmdline BOOT_IMAGE=/bzImage root=/dev/mapper/NTBgroup-System20 ro log_buf_len=16M rd.md=0 rd.dm=0 LANG=en_GB.UTF-8 SYSFONT=latarcyrheb-sun16 KEYTABLE=uk rd.luks=0 rd.lvm.lv=NTBgroup/System20 rd.lvm.lv=NTBgroup/Swap swapaccount=1 systemd.unit=signage.target net.ifnames=0 consoleblank=0 i915.enable_fbc=0 rhgb quiet (In reply to comment #38) > For the record that last hang wasn't with my requests branch. (I could be in > for a beating.) Ooops, sorry, yes. That's Linus's tree with the patch from comment #24 applied. I'm now testing #requests, to see if I can knock that over. So far, only thing I've had is: [ 1224.506836] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle and I'm not seeing any consequences from that. (In reply to comment #38) > For the record that last hang wasn't with my requests branch. (I could be in > for a beating.) After an afternoon of beating on the device under test, I have no failures from the requests branch. [ 1224.506836] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle [ 8303.940102] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle are the only things logged by the kernel, and the kernel was able to recover from both of them. Thanks, the missed irq/seqno coherency is worrying enough, but at least it confirms that there is some magic in there that seems to prevent the ctx load error. Do you mind keeping that test system running until a bug shows itself? (In reply to comment #42) > Thanks, the missed irq/seqno coherency is worrying enough, but at least it > confirms that there is some magic in there that seems to prevent the ctx > load error. Do you mind keeping that test system running until a bug shows > itself? I can keep the test system running indefinitely, as long as I can get a stable-suitable patch ASAP to stop the GPU hang. (In reply to comment #42) > Thanks, the missed irq/seqno coherency is worrying enough, but at least it > confirms that there is some magic in there that seems to prevent the ctx > load error. Do you mind keeping that test system running until a bug shows > itself? Still no further issues. It looks like I can only provoke that message by restarting X and the compositor; would you like me to set that going in an endless loop and see if it BUG()s? (In reply to comment #44) > (In reply to comment #42) > > Thanks, the missed irq/seqno coherency is worrying enough, but at least it > > confirms that there is some magic in there that seems to prevent the ctx > > load error. Do you mind keeping that test system running until a bug shows > > itself? > > Still no further issues. > > It looks like I can only provoke that message by restarting X and the > compositor; would you like me to set that going in an endless loop and see > if it BUG()s? Nah, worked out the cause there. It is the ivb+ blt irq coherency bug, and a bad interaction of patches in my branch broke the w/a. I've been trying to think as to what other magic could be in s/seqno/requests/ that fixup the ctx hang. I think we have more or less explored the ctx specific parts of the patch. So now what? :| (Just marking this bug for special interest, since we have a patch that seems to work, just not yet the right patch.) Small, but it forces the invalidate after the ctx load: diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c b/drivers/gpu/drm/i915/i915_gem_execbuffer.c index 1a0611bb576b..9676bc729f13 100644 --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c @@ -1082,11 +1082,11 @@ i915_gem_ringbuffer_submission(struct drm_device *dev, struct drm_file *file, } } - ret = i915_gem_execbuffer_move_to_gpu(ring, vmas); + ret = i915_switch_context(ring, ctx); if (ret) goto error; - ret = i915_switch_context(ring, ctx); + ret = i915_gem_execbuffer_move_to_gpu(ring, vmas); if (ret) goto error; (In reply to Chris Wilson from comment #47) > Small, but it forces the invalidate after the ctx load: > > diff --git a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > index 1a0611bb576b..9676bc729f13 100644 > --- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c > +++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c > @@ -1082,11 +1082,11 @@ i915_gem_ringbuffer_submission(struct drm_device > *dev, struct drm_file *file, > } > } > > - ret = i915_gem_execbuffer_move_to_gpu(ring, vmas); > + ret = i915_switch_context(ring, ctx); > if (ret) > goto error; > > - ret = i915_switch_context(ring, ctx); > + ret = i915_gem_execbuffer_move_to_gpu(ring, vmas); > if (ret) > goto error; Applied the moral equivalent of that change to 3.16.3, and I see failure. I'll attach the error state. Created attachment 107665 [details]
Error state with invalidate after context switch
(In reply to Simon Farnsworth from comment #49) > Created attachment 107665 [details] > Error state with invalidate after context switch Looks like same error. However, there is also massive corruption of the render ring. Either that or the error state capture is snafu. Would you be happy with a backport of mammoth patch if it proved to be stable? (In reply to Chris Wilson from comment #50) > (In reply to Simon Farnsworth from comment #49) > > Created attachment 107665 [details] > > Error state with invalidate after context switch > > Looks like same error. However, there is also massive corruption of the > render ring. Either that or the error state capture is snafu. Would you be > happy with a backport of mammoth patch if it proved to be stable? A 3,000 patch, 80 MB patchset would be fine if it were stable on HSW and IVB. *** Bug 80229 has been marked as a duplicate of this bug. *** Created attachment 108029 [details] [review] Backport of requests and PPGTT changes to 3.17.0 I've backported the changes from #requests to apply against the kernel RPM from http://koji.fedoraproject.org/koji/buildinfo?buildID=583526 This is a fairly intrusive backport - I've tried to take drivers/gpu/drm/i915 wholesale, then remove execlists/logical ring contexts rather than piecemeal bring things forwards. I'd appreciate it if someone could look over what I've done, and tell me if it makes sense. I tested Simon's patch applied to a Ubuntu 3.17.1 kernel on a 1820T system. The GPU hang did not show during a three hours test. In general it shows within the first 20 min after system start. Will do more tests. This patch seems to introduce new problems for me. I place fences into the render pipeline and after a glFlush I don't get all of them into the signaled state. This procedure worked without this patch and with other driver like NVidia and AMD so the issue most likely got introduced with this huge patch. (In reply to Rainer Hochecker from comment #55) > This patch seems to introduce new problems for me. I place fences into the > render pipeline and after a glFlush I don't get all of them into the > signaled state. This procedure worked without this patch and with other > driver like NVidia and AMD so the issue most likely got introduced with this > huge patch. diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 6bf2dcf67bf2..158abb4c322a 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2544,13 +2544,16 @@ i915_gem_idle_work_handler(struct work_struct *work) static int i915_gem_object_flush_active(struct drm_i915_gem_object *obj) { - int ret; + int ret, n; if (!obj->active) return 0; - if (obj->last_write.request) { - ret = i915_request_emit_breadcrumb(obj->last_write.request); + for (n = 0; < I915_NUM_ENGINES; n++) { + if (obj->last_read[n].request == NULL) + continue; + + ret = i915_request_emit_breadcrumb(obj->last_read[n].request); if (ret) return ret; which I hope is overkill. If you have a snippet demonstrating the fences I can trace through mesa and see if there is a more precise flush we can do. Not sure if this is realated: currently vaapi lacks a good interop method with gl so we use vaPutSurface and texture-from-pixmap. The entire render pipeline is : decode with vaapi, vaapi postprocessing (deinterlacing), vaPutSurface (render vaapi video surface into pixmap), map pixmap to texture, render texture, place fence when fence signals, we know that video surface (and some other resources) are ready for reuse this places the fence: https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L1202 this function checks for signaled fences: https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L2022 I have had video playing for 4 hours today without any issues. but as soon as I stop playback it waits for all fences to be signaled, which is COutput::ProcessSyncPicture to return false but this does not happen. There is at least one fence not in GL_SIGNALED state. When playback is stopped there is a glFlush before COutput::ProcessSyncPicture https://github.com/FernetMenta/xbmc/blob/master/xbmc/cores/dvdplayer/DVDCodecs/Video/VAAPI.cpp#L1709 It never comes out of the while loop at the next line. Peter has built a new kernel with the last patch. will test this tomorrow and report back The patch in comment 56 fixes the issue. Created attachment 108310 [details]
chrashlog on Google chromebox
Attached zip contains dmesg and error after the problem reoccured with the latest chris wilson patches on a kernel 3.17.1 kernel.
Hardware: Celeron 2955U.
Sadly one of our users still can get frequent crashes with the latest code linked here, which is the extracted patch of Simon and the additional fix chris willson made for fences problem. Perhaps the dmesg helps a bit in that case, cause it looks much more detailed than before. Created attachment 108459 [details]
dmesg output after suspend/resume
I tried Simon's patch for kernel 3.17.0 and Chris Wilson's latest patch and I have not experienced the hangs yet. However, I noticed that resume from suspend no longer works. The screen flickers and then remains black. Attached is the dmesg output.
I have this exact bug too. My hardware is an (ex-)Chromebook Acer C720P with Haswell-ULT graphcs. I'm running linux-3.17 with xorg 1.16 and intel 2.99.916. The desktop has kwin, which autodetects the hang and resumes as if nothing happend, except for a 5 second or so pause. So no, x restarting or other bad crashes. What I've noted the hang happens occasionally when running firefox (1x per day), but often running Chromium (1x per hour?). How can I help? I just noticed that if I use this kernel option: i915.enable_ppgtt=0, I get a different hang: 8.562995] [drm] stuck on render ring [ 498.564286] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: Ring hung, action: reset [ 498.564289] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 498.564290] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 498.564291] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 498.564293] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 498.564294] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 507.545019] [drm] stuck on render ring [ 507.546315] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: Ring hung, action: reset [ 507.546921] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning! It seems like this hang has been fixed here: https://www.libreoffice.org/bugzilla/show_bug.cgi?id=78533 , but when I compare the patch in that post with the kernel 3.17.1, there are things missing. For example, I see that this doesn't fully match up: --- a/drivers/gpu/drm/i915/i915_drv.h +++ b/drivers/gpu/drm/i915/i915_drv.h @@ -2139,13 +2139,16 @@ void i915_init_vm(struct drm_i915_private *dev_priv, void i915_gem_free_object(struct drm_gem_object *obj); void i915_gem_vma_destroy(struct i915_vma *vma); -#define PIN_MAPPABLE 0x1 -#define PIN_NONBLOCK 0x2 -#define PIN_GLOBAL 0x4 +#define PIN_OFFSET_FIXED 0x1 +#define PIN_OFFSET_BIAS 0x2 +#define PIN_MAPPABLE 0x4 +#define PIN_NONBLOCK 0x8 +#define PIN_GLOBAL 0x10 +#define PIN_OFFSET_MASK (~4095) Is this something worth investigating, or am I wasting my time? I can confirm that patch posted by Simon with i915.enable_rc6=0 does not fix the issue. I also looked into my previous question and it didn't help. *** Bug 85765 has been marked as a duplicate of this bug. *** This comment was the result of test with kernel 3.17.1 with the patch submitted by Simon here. only so far.(In reply to Hugh Greenberg from comment #64) > I just noticed that if I use this kernel option: i915.enable_ppgtt=0, I get > a different hang: > > 8.562995] [drm] stuck on render ring > [ 498.564286] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: > Ring hung, action: reset > [ 498.564289] [drm] GPU hangs can indicate a bug anywhere in the entire gfx > stack, including userspace. > [ 498.564290] [drm] Please file a _new_ bug report on bugs.freedesktop.org > against DRI -> DRM/Intel > [ 498.564291] [drm] drm/i915 developers can then reassign to the right > component if it's not a kernel issue. > [ 498.564293] [drm] The gpu crash dump is required to analyze gpu hangs, so > please always attach it. > [ 498.564294] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 507.545019] [drm] stuck on render ring > [ 507.546315] [drm] GPU HANG: ecode 0:0x85dffffd, in Xorg [1161], reason: > Ring hung, action: reset > [ 507.546921] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, > banning! > > It seems like this hang has been fixed here: > https://www.libreoffice.org/bugzilla/show_bug.cgi?id=78533 , but when I > compare the patch in that post with the kernel 3.17.1, there are things > missing. For example, I see that this doesn't fully match up: > > --- a/drivers/gpu/drm/i915/i915_drv.h > +++ b/drivers/gpu/drm/i915/i915_drv.h > @@ -2139,13 +2139,16 @@ void i915_init_vm(struct drm_i915_private *dev_priv, > void i915_gem_free_object(struct drm_gem_object *obj); > void i915_gem_vma_destroy(struct i915_vma *vma); > > -#define PIN_MAPPABLE 0x1 > -#define PIN_NONBLOCK 0x2 > -#define PIN_GLOBAL 0x4 > +#define PIN_OFFSET_FIXED 0x1 > +#define PIN_OFFSET_BIAS 0x2 > +#define PIN_MAPPABLE 0x4 > +#define PIN_NONBLOCK 0x8 > +#define PIN_GLOBAL 0x10 > +#define PIN_OFFSET_MASK (~4095) > > > Is this something worth investigating, or am I wasting my time? Chris Wilson gave me the following kernel parameter to try: i915.enable_ppgtt=0 on a stock kernel (not including the giant patch referenced here) and I have not been able to reproduce the hangs with it. I've tested with kernel 3.17.1 and 3.17.2. Please confirm or deny that this works for you. Thanks. (In reply to Hugh Greenberg from comment #68) > Chris Wilson gave me the following kernel parameter to try: > > i915.enable_ppgtt=0 > > on a stock kernel (not including the giant patch referenced here) and I have > not been able to reproduce the hangs with it. I've tested with kernel 3.17.1 > and 3.17.2. Please confirm or deny that this works for you. Thanks. I tried this 3 weeks ago and did not help: https://bugs.freedesktop.org/show_bug.cgi?id=80229#c62 (In reply to Rainer Hochecker from comment #69) > (In reply to Hugh Greenberg from comment #68) > > Chris Wilson gave me the following kernel parameter to try: > > > > i915.enable_ppgtt=0 > > > > on a stock kernel (not including the giant patch referenced here) and I have > > not been able to reproduce the hangs with it. I've tested with kernel 3.17.1 > > and 3.17.2. Please confirm or deny that this works for you. Thanks. > > I tried this 3 weeks ago and did not help: > https://bugs.freedesktop.org/show_bug.cgi?id=80229#c62 Yes, you are right. I just experienced the hang again. Sorry. Created attachment 108976 [details]
dump after error with i915.enable_ppgtt=0
This is the error dump I got after booting the kernel with: i915.enable_ppgtt=0 .
Am I correct in the summary that especially Celeron, Pentium HSW GPUs are affected? Our testing on the xbmc forums shows the same results. It seems only the simple HSW GPUs frequently run into this hang. Perhaps someone could have a look if the "ringbuffer" is one bit off or something happens, when we get rounding, clamping, fragmentation anything? Some Flush missing as in the past in Mesa? It might very well be that other higher GPU series also have that issue, but they are not hit that frequently cause of more Execution Units? Perhaps more load can stall them too? I have Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 09) (Intel Celeron 2955U) I guess that matches the profile? This is on a Acer C720P Chromebook booting using the built-in Seabios (developer mode). Strange thing: I notice hangs often when running chromium, but never when booting ChromeOS (which runs Chrome). So either Google fixed something in their kernel, or they have GPU configured differently when coreboot boots ChromeOS then when coreboot boots Seabios. Or, more recent kernels then ChromeOS's introduce the problem? Can you try the chrome kernel version on your linux installation? Ubuntu mainline has some of those (if you are using Ubuntu). That would be a good start for bisecting. They have? I didn't know that. Which kernel would you like me to try? Up to now I had 3.13 from 14.04 with some patched up drivers to get touchpad and touchscreen working, and 3.17rc?, 3.17 and 3.17.1 from kernel ppa (mainline kernels). 3.16 from 14.10 does not have the drivers included and no patched drivers available afaik, so doesn't work to well. Also tried 3.18rc? but that seem to be in the best shape right now (rc1 didn't even boot). 3.13 + patches and 3.17 seem to run equally well but with the exact same GPU hang. Also xorg-edgers ppa makes no change. Created attachment 109009 [details] [review] possible hang fix This is a small patch to change a register definition that I think is wrong. It is against 3.17.2, but it should work for at least any 3.17 kernel. Please let me know if it fixes the issue or not. (In reply to Hugh Greenberg from comment #76) > Created attachment 109009 [details] [review] [review] > possible hang fix > > This is a small patch to change a register definition that I think is wrong. > It is against 3.17.2, but it should work for at least any 3.17 kernel. > Please let me know if it fixes the issue or not. No special boot parameters are needed. For easy testing I build Ubuntu kernel packages based on my gpuhang branch at: https://github.com/fritsch/linux/tree/gpuhang This is stable 3.17.2 with the patch Hugh Greenberg provided: https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.2-fix-gpu-hang%2B_3.17.2-fix-gpu-hang%2B-10.00.Custom_amd64.deb https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.2-fix-gpu-hang%2B_3.17.2-fix-gpu-hang%2B-10.00.Custom_amd64.deb Happy testing to those that run the affected hardware. Created attachment 109018 [details]
gpu hang error with Greenberg patch
Kernel hang with patch provide by Hugh Greenberg
Created attachment 109019 [details]
dmesg 3.17.2 + Greenberg patch
Never the less I think you are onto somethin. [ 869.806084] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 869.806084] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 869.806085] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 869.806086] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 974.865662] [drm] stuck on render ring [ 974.869560] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 1248.988099] [drm] stuck on render ring [ 1248.992108] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 1354.039617] [drm] stuck on render ring [ 1354.043649] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 1472.097518] [drm] stuck on render ring [ 1472.101540] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 2204.468693] [drm] stuck on render ring [ 2204.472663] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 2430.567575] [drm] stuck on render ring [ 2430.571278] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 3101.896809] [drm] stuck on render ring [ 3101.900614] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 5090.884258] [drm] stuck on render ring [ 5090.888231] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset [ 5402.024853] [drm] stuck on render ring [ 5402.028782] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], reason: Ring hung, action: reset It seems (no scientific proof) the hang occurs even more frequently with the patch applied. Every HANG one can see there will freeze the render for a specific amount of time - which has a huge visual impact. Happy to test other ideas. *** Bug 85503 has been marked as a duplicate of this bug. *** Thanks for trying it. The patch is wrong, sorry. I'll keep working on it. To reproduce this in Kodi, are there any settings that I need to enable? (In reply to Peter Frühberger from comment #81) > Never the less I think you are onto somethin. > > [ 869.806084] [drm] Please file a _new_ bug report on bugs.freedesktop.org > against DRI -> DRM/Intel > [ 869.806084] [drm] drm/i915 developers can then reassign to the right > component if it's not a kernel issue. > [ 869.806085] [drm] The gpu crash dump is required to analyze gpu hangs, so > please always attach it. > [ 869.806086] [drm] GPU crash dump saved to /sys/class/drm/card0/error > [ 974.865662] [drm] stuck on render ring > [ 974.869560] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 1248.988099] [drm] stuck on render ring > [ 1248.992108] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 1354.039617] [drm] stuck on render ring > [ 1354.043649] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 1472.097518] [drm] stuck on render ring > [ 1472.101540] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 2204.468693] [drm] stuck on render ring > [ 2204.472663] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 2430.567575] [drm] stuck on render ring > [ 2430.571278] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 3101.896809] [drm] stuck on render ring > [ 3101.900614] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 5090.884258] [drm] stuck on render ring > [ 5090.888231] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > [ 5402.024853] [drm] stuck on render ring > [ 5402.028782] [drm] GPU HANG: ecode 0:0x87d3bffa, in kodi.bin [858], > reason: Ring hung, action: reset > > It seems (no scientific proof) the hang occurs even more frequently with the > patch applied. Every HANG one can see there will freeze the render for a > specific amount of time - which has a huge visual impact. > > Happy to test other ideas. Quite easy, build latest master (helix beta1) via https://github.com/xbmc/xbmc/commits/master or use a nightly ppa. Enable VAAPI (disable VDPAU). Under Video -> Acceleration, check that "Prefer VAAPI Render Method" is enabled, which is the default. You need to switch the settings hierarchy to "Expert" to see those settings. You can highly provoke that error by watching interlaced content and using Motion Compensation Deinterlacing with VPP ontop. Beware, there is no release vaapi driver that supports this yet, you need to use: http://cgit.freedesktop.org/vaapi/intel-driver/log/ e.g. the master branch, Gwenole repaired the driver (vebox fixes) some months ago and pushed the results into libva master branch last week. While playing such an interlaced video, press return, select the video film role and activate Deinterlace: Auto Deinterlace-Method: Moption Compensation Deinterlacing. Save for all files. Now wait <= 10 minutes and see it hanging. Btw. using vpp deinterlacing seems to stress the GPU more, so you don't need to keep it running for hours (as it would be with progressive content). You need a Celeron HSW platform to reproduce. On my Core Systems the issue nearly never happens. @Peter Frühberger Can you tell me the name of the package you refer to in comment #74? There is no package. That was a question, only. A short google revealed that the chrome guys seems to use something highly customized, so I don't think such a package exists. I can't figure this out. I'm not an Intel or kernel developer, but maybe I could figure it out with hardware debugging support or docs that were correct. I would recommend that we just stop purchasing Intel GPUs and go with Nvidia based GPUs. This bug report is one of many for the same bug. My first report was from June. I really don't think this is going to get fixed. Yeah. You exactly make the right point. And sorry - I thought you were an intel dev last time :-). I even searched for you in the intel channel. Now I know why the other intel devs could not find you. Thanks much for trying to help. It really feels like being a 3rd party citizen. I am not sure what else we could do to solve that issue. I will also ignore that bugreport from now on .. I have a PR ready to remove VAAPI from xbmc. I think this will be a good signal for protesting. (In reply to Peter Frühberger from comment #88) > Yeah. You exactly make the right point. And sorry - I thought you were an > intel dev last time :-). I even searched for you in the intel channel. Now I > know why the other intel devs could not find you. > > Thanks much for trying to help. > > It really feels like being a 3rd party citizen. I am not sure what else we > could do to solve that issue. > > > I will also ignore that bugreport from now on .. I have a PR ready to remove > VAAPI from xbmc. I think this will be a good signal for protesting. No problem. I should have made that clear. I have an Acer C720 and I've making Linux distributions for it so other Acer C720 owners that want Linux can easy install it without having to figure a ton of things out. This was the last major bug that I have encountered. I also started a site around those distros (distroshare.com) so there could be a single place for others to share such distributions. I'm a big fan of Kodi/XMBC btw. Thanks for such an awesome software. In case you weren't aware, this bug will actually affect any hardware acceleration path that XBMC takes, not just the VAPPI one. It just seems to show up more with VAPPI. Anything that uses the dri/drm layers will encounter this bug. Maybe you can direct your users that encounter this bug to submit a bug report. (In reply to Peter Frühberger from comment #88) > Yeah. You exactly make the right point. And sorry - I thought you were an > intel dev last time :-). I even searched for you in the intel channel. Now I > know why the other intel devs could not find you. > > Thanks much for trying to help. > > It really feels like being a 3rd party citizen. I am not sure what else we > could do to solve that issue. > > > I will also ignore that bugreport from now on .. I have a PR ready to remove > VAAPI from xbmc. I think this will be a good signal for protesting. Created attachment 109165 [details]
GPU-hang dmesg output on Pentium G3420 using OpenELEC 4.2.1/XBMC
Hi,
I'm having the crashes described here quite often while running XBMC on OpenELEC 4.2.1 (latest stable):
[15194.281458] [drm] stuck on render ring
[15194.282221] [drm] GPU HANG: ecode 0:0x87d3bffa, in xbmc.bin [780], reason: Ring hung, action: reset
[15196.281552] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
Complete dmesg attached.
This causes the video to freeze every few seconds, while audio continues normally. After the hang the video 'catches up' by skipping a lot of frames until it's in sync again. I'm using x264 mpeg4 source material.
This is a very annoying bug for me and seems to make my new media system seem like a waste of money.
Please investigate this issue further!
Thanks,
M. Kramer
Hi, I have a haswell pentium G3240 cpu and adding the following kernel parameters to my bootloader seem to prevent the gpu hangs for me. i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1 I don't know how to add these kernel parameters in openelec so you will need to ask on their forum how to do this. Hope this helps. Also note that this guy with a haswell chromebook uses only the following kernel parameters to prevent the gpu hangs https://johnlewis.ie/tentative-fixwork-around-for-i915-gpu-hangs/ I think i'll try reducing it down to these options too. drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0 I am running arch linux with kernel 3.17.2 and the following elements: - libva-intel-driver 1.4.1 - libva 1.4.1 - xf86-video-intel 2.99.916 - xbmc 13.2 Hope this information helps. We all tried the fix from John Lewis. It caused system hangs and we couldn't figure out which options actually helped. I'm trying your modified version of the command line and so far I've been able to use Kodi with VAPPI for a long time (2 hours I think) without a hang or freeze. I will keep it going for the rest of day though before I am sure that this really works. (In reply to adr3nal1n from comment #92) > Hi, > > I have a haswell pentium G3240 cpu and adding the following kernel > parameters to my bootloader seem to prevent the gpu hangs for me. > > i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 > drm.vblankoffdelay=1 > > I don't know how to add these kernel parameters in openelec so you will need > to ask on their forum how to do this. > > Hope this helps. This bug report is where we were testing that fix: https://bugs.freedesktop.org/show_bug.cgi?id=80229. Comment 58 has the same command line as you: https://bugs.freedesktop.org/show_bug.cgi?id=80229#c58, except for the vblankoffdelay, and encountered system freezes. (In reply to adr3nal1n from comment #93) > Also note that this guy with a haswell chromebook uses only the following > kernel parameters to prevent the gpu hangs > https://johnlewis.ie/tentative-fixwork-around-for-i915-gpu-hangs/ I think > i'll try reducing it down to these options too. > > drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0 > > I am running arch linux with kernel 3.17.2 and the following elements: > > - libva-intel-driver 1.4.1 > - libva 1.4.1 > - xf86-video-intel 2.99.916 > - xbmc 13.2 > > Hope this information helps. Hope your testing goes well Hugh, I have been using (i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1) with xbmc gotham for a couple of days now with no hangs. I'll post again if i notice any hangs over the coming days. (I normally use xbmc for a few hours a day) If someone could figure out how to dump the GPU instructions on a Windows machine with a Haswell chipset, I think we could develop a patch. Why the Intel developers couldn't do this is beyond me. (In reply to adr3nal1n from comment #96) > Hope your testing goes well Hugh, > > I have been using (i915.semaphores=0 i915.use_mmio_flip=1 > i915.enable_ppgtt=1 drm.vblankoffdelay=1) with xbmc gotham for a couple of > days now with no hangs. > > I'll post again if i notice any hangs over the coming days. (I normally use > xbmc for a few hours a day) Hi Hugh, Just wanted to let you know that XBMC dev fritsch stated the following regarding the use of the above kernel parameters. "We made longtime tests and the same happens after 12 hours or more. So it seems to make the bug "more unlikely", but if you add additional load on the GPU (as we do with v14 and the VPP Deinterlacers), you will get the hang again." I am this guy (https://bugs.freedesktop.org/show_bug.cgi?id=83677#c84) aka fritsch. You should return it if you can. This bug has been around for more than 6 months. It doesn't seem like it will be fixed any time soon. (In reply to M. Kramer from comment #91) > Created attachment 109165 [details] > GPU-hang dmesg output on Pentium G3420 using OpenELEC 4.2.1/XBMC > > Hi, > > I'm having the crashes described here quite often while running XBMC on > OpenELEC 4.2.1 (latest stable): > > [15194.281458] [drm] stuck on render ring > [15194.282221] [drm] GPU HANG: ecode 0:0x87d3bffa, in xbmc.bin [780], > reason: Ring hung, action: reset > [15196.281552] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off > > Complete dmesg attached. > > This causes the video to freeze every few seconds, while audio continues > normally. After the hang the video 'catches up' by skipping a lot of frames > until it's in sync again. I'm using x264 mpeg4 source material. > > This is a very annoying bug for me and seems to make my new media system > seem like a waste of money. > > Please investigate this issue further! > > Thanks, > M. Kramer I also suffer from this bug, an openelec bug can be seen here' http://sprunge.us/QBKV and the gpu crash log may be downloaded here; http://www.demenno.nl/error.txt thats almost 4MB. Intel please fix this asap! Created attachment 109331 [details]
1037U kernel BUG traceback in i915 code
The picture shows a kernel BUG that we can reproduce on IvyBridge CPUs.
This traceback is from a 1037U running the kernel patched by Simon Farnsworth with the advice of Chris Wilson.
We can reproduce but not at will.
We have test code that will provoke the bug given enough test runs.
use hw de-interlacers and you'll reproduce in max 11 minutes, every time. (see my logs). (In reply to Barry Scott from comment #102) > Created attachment 109331 [details] > 1037U kernel BUG traceback in i915 code > > The picture shows a kernel BUG that we can reproduce on IvyBridge CPUs. > This traceback is from a 1037U running the kernel patched by Simon > Farnsworth with the advice of Chris Wilson. > > We can reproduce but not at will. > We have test code that will provoke the bug given enough test runs. Would you mind sharing how you did that? Created attachment 109459 [details] [review] Force a CS stall inside gen7 invalidate-caches Here are 3.17.2 mainline kernel builds with the patch applied: https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.2wilsonv1_3.17.2wilsonv1-10.00.Custom_amd64.deb https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.2wilsonv1_3.17.2wilsonv1-10.00.Custom_amd64.deb I will be traveling until Sunday, so give those a nice test, please. I made a short test on my 1820T: http://paste.ubuntu.com/9008515/ - did not help, got the gpu hang after exactly 20 seconds. Here are new test kernels, chris wilson wants you to test: https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-rc5-icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-rc5-icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb You can find a fork of this branch on github (to download with a faster connection): https://github.com/fritsch/linux/commits/ickle-master Would be nice, if you could try it. Feedback would be nice. I tried it, and while LightDM loaded, I couldn't log in since X crashed on login. Here is a stack trace: (gdb) where #0 0x00007f5857825d27 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x00007f5857827418 in __GI_abort () at abort.c:89 #2 0x00007f5859c77f6e in OsAbort () at ../../os/utils.c:1361 #3 0x00007f5859c7d7c3 in AbortServer () at ../../os/log.c:786 #4 0x00007f5859c7e60d in FatalError ( f=f@entry=0x7f5859c922fc "%s: VT_ACTIVATE failed: %s\n") at ../../os/log.c:924 #5 0x00007f5859b78151 in switch_to (vt=7, from=0x7f5859c92375 "xf86OpenConsole") at ../../../../../hw/xfree86/os-support/linux/lnx_init.c:72 #6 0x00007f5859b783e9 in xf86OpenConsole () at ../../../../../hw/xfree86/os-support/linux/lnx_init.c:209 #7 0x00007f5859b54e9d in InitOutput ( pScreenInfo=pScreenInfo@entry=0x7f5859f13b00 <screenInfo>, argc=argc@entry=11, argv=argv@entry=0x7fff0f438a78) at ../../../../hw/xfree86/common/xf86Init.c:597 #8 0x00007f5859b160ba in dix_main (argc=11, argv=0x7fff0f438a78, envp=<optimized out>) at ../../dix/main.c:202 #9 0x00007f5857810ec5 in __libc_start_main (main=0x7f5859b00680 <main>, argc=11, argv=0x7fff0f438a78, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fff0f438a68) at libc-start.c:287 #10 0x00007f5859b006ae in _start () (In reply to Peter Frühberger from comment #108) > Here are new test kernels, chris wilson wants you to test: > > https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-rc5- > icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb > https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-rc5- > icklemasterv1%2B_3.18.0-rc5-icklemasterv1%2B-10.00.Custom_amd64.deb > > You can find a fork of this branch on github (to download with a faster > connection): https://github.com/fritsch/linux/commits/ickle-master > > Would be nice, if you could try it. > > Feedback would be nice. (In reply to Hugh Greenberg from comment #109) > I tried it, and while LightDM loaded, I couldn't log in since X crashed on > login. Here is a stack trace: That's a bug in the VT layer; a race in the graphics mode takeover of the console. You have to restart X - I forget that lightdm doesn't handle that automatically. LightDM and the unity desktop worked after I disabled dri3 in the intel driver as Chris suggested. After that change and enabling tear free (also as Chris suggested), I was able to test this kernel. I was able to play a video in Kodi for over an hour using the VAAPI support as Peter described above and I did not experience a hang. (In reply to Chris Wilson from comment #110) > (In reply to Hugh Greenberg from comment #109) > > I tried it, and while LightDM loaded, I couldn't log in since X crashed on > > login. Here is a stack trace: > > That's a bug in the VT layer; a race in the graphics mode takeover of the > console. You have to restart X - I forget that lightdm doesn't handle that > automatically. Tearfree is a nightmare for applications that count swapBuffers ... But never the less that sounds promissing, now we need to find out which change is the real fix. Can you post the xorg.conf sniplet you use to make it work? This is the config file that I put in /usr/share/X11/xorg.conf.d: Section "Device" Identifier "Intel Graphics" Driver "intel" Option "TearFree" "true" EndSection Here is my x11 intel driver recompiled with the --disable-dri3 option: https://drive.google.com/file/d/0B6zPD2kAJoTJcHdNS1J1VWpKY2s/view?usp=sharing . I'm using the oibaf ppa for the latest graphics stack - https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers. (In reply to Peter Frühberger from comment #112) > Tearfree is a nightmare for applications that count swapBuffers ... > > But never the less that sounds promissing, now we need to find out which > change is the real fix. > > Can you post the xorg.conf sniplet you use to make it work? Thanks much. I will keep TearFree of, from "man intel" one sees why it is not wanted in xbmc context. If I didn't turn it off, I got hangs just by launching xbmc. You'll probably see the same thing. (In reply to Peter Frühberger from comment #114) > Thanks much. I will keep TearFree of, from "man intel" one sees why it is > not wanted in xbmc context. I mean that if I didn't turn tear free on, I got the hangs. (In reply to Hugh Greenberg from comment #115) > If I didn't turn it off, I got hangs just by launching xbmc. You'll probably > see the same thing. > > (In reply to Peter Frühberger from comment #114) > > Thanks much. I will keep TearFree of, from "man intel" one sees why it is > > not wanted in xbmc context. Ah nice! I have TearFree turned off, running with the "normal" intel drivers now. So I should also hit the bug now? You got much farther than me without modifying anything. I couldn't even launch xbmc without dri3 disabled and I got the hangs with TearFree turned off. (In reply to Peter Frühberger from comment #117) > Ah nice! > > I have TearFree turned off, running with the "normal" intel drivers now. So > I should also hit the bug now? I am running the following intel packages: ii xserver-xorg-video-intel 2:2.99.910-0ubuntu1.1 amd64 X.Org X server -- Intel i8xx, i9xx display driver try those - standard packages. We have massive issues with everything > 910, as it seems it has issues with swap buffers, therefore we use 910 on all the machines. Just purge the oibaf ppa. The reason why I didn't do that is because the vaapi driver that you were testing required a newer libva. I didn't know if that had dependencies on the newer mesa or not. I guess not. (In reply to Peter Frühberger from comment #119) > I am running the following intel packages: > ii xserver-xorg-video-intel 2:2.99.910-0ubuntu1.1 > amd64 X.Org X server -- Intel i8xx, i9xx display driver > > try those - standard packages. We have massive issues with everything > 910, > as it seems it has issues with swap buffers, therefore we use 910 on all the > machines. > > Just purge the oibaf ppa. We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi nothing else is needed. I run 10.3 mesa from utopic though. Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good so far. I'll keep xbmc going for a while again. (In reply to Peter Frühberger from comment #121) > We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi > > nothing else is needed. I run 10.3 mesa from utopic though. I'm not sure when to celebrate, but I still haven't experienced any hangs. (In reply to Hugh Greenberg from comment #122) > Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good > so far. I'll keep xbmc going for a while again. > > (In reply to Peter Frühberger from comment #121) > > We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi > > > > nothing else is needed. I run 10.3 mesa from utopic though. I could be wrong again, but I'm guessing that this is the patch: https://github.com/fritsch/linux/commit/dba076df4b79d2472ef5d6e19b72ca3856eafb1a . I'll try just that patch and report back here later. (In reply to Hugh Greenberg from comment #123) > I'm not sure when to celebrate, but I still haven't experienced any hangs. > > (In reply to Hugh Greenberg from comment #122) > > Thanks! I'm on utopic with mesa 10.3 now and your vaapi ppa. Things are good > > so far. I'll keep xbmc going for a while again. > > > > (In reply to Peter Frühberger from comment #121) > > > We have a ppa for vaapi: https://launchpad.net/~wsnipex/+archive/ubuntu/vaapi > > > > > > nothing else is needed. I run 10.3 mesa from utopic though. You know what :-) I exactly thought the same. But I don't understand the code too much, so did not try. Can you "fix" 3.17.3 with that patch picked on top? Yes, I will do that and post the links here. (In reply to Peter Frühberger from comment #125) > You know what :-) > > I exactly thought the same. But I don't understand the code too much, so did > not try. Can you "fix" 3.17.3 with that patch picked on top? Patch does not apply cleanly, the batch_buffer does not seem to be there. Let's wait what chris willson will tell us? I picked and fixed what I think could be right to: https://github.com/fritsch/linux/tree/gpuhang Save your time. Chris Wilson told on IRC, that this fix will 100% not fix our bug. I backported Chris Wilson's branch to 3.17.4. Patch: https://drive.google.com/file/d/0B6zPD2kAJoTJNEZnczJ3YU1ickU/view?usp=sharing Kernel debs: https://drive.google.com/file/d/0B6zPD2kAJoTJejNLdEFCS01lblk/view?usp=sharing https://drive.google.com/file/d/0B6zPD2kAJoTJMXJlY3NYSVZfd2M/view?usp=sharing I've tested these for 20+ hours and it has been working well. The only thing is that TearFree needs to be enabled in the intel driver until there is a patch available for that. You can enable it like this: https://wiki.archlinux.org/index.php/Intel_graphics#Tear-free_video . *** Bug 86670 has been marked as a duplicate of this bug. *** The huge patch that ported to 3.17.4 by Hugh Greenberg is working very well for me. After a day of work with the system I didn't experience any freezes or hangs with Chromium or at all. I only kept the kernel parameter i915.modeset=1. I did experienced a rare slowdown of Chromium with a segfault in journald, I didn't saw such segfault before. kernel: WebCore: Worker[15893]: segfault at fbadbeef ip 00007f75abde2e25 sp 00007f758d175190 error 6 in chromium[7f75a993d000+5b6f000 What changed from upstream kernel is that with upstream: * when no kernel parameter used (except i915.modeset=1) Chromium would hangs and would force me to kill it. * when using the parameters: i915.modeset=1 i915.semaphores=0 i915.use_mmio_flip=1 i915.enable_ppgtt=1 drm.vblankoffdelay=1, instead of hanging, Chromium would slow down the system to almost a halt, Kodi would also triger such slowdowns (in much higher rate than with the patch), it seems like i915.semaphores=0 is the one making the difference between hang to slowdown. I didn't gave much attention to testing vaapi, but it does seem works fine. Please test this patch http://patchwork.freedesktop.org/patch/37647/ (In reply to Daniel Vetter from comment #133) > Please test this patch > > http://patchwork.freedesktop.org/patch/37647/ I've already tested that theory with Simon's testcase. It's another dead end. Setting this to Assigned again as the main dev already tested the bits requested by danvet as non working. Finally I encounter a hang with Chris Wilson's branch and kernel 3.17.4. kernel: [drm] GPU HANG: ecode 7:0:0x87d3bffa, in chromium [15797], reason: Stuck on render ring, action: reset My system froze after I reopened Chromium so I don't have the gpu crash dump. I think that it is possible that this is a different problem due to chromium and hardware acceleration. I have been running on two devices for 7 days straight without a single hang. (In reply to dhead666 from comment #136) > Finally I encounter a hang with Chris Wilson's branch and kernel 3.17.4. > > kernel: [drm] GPU HANG: ecode 7:0:0x87d3bffa, in chromium [15797], reason: > Stuck on render ring, action: reset > > My system froze after I reopened Chromium so I don't have the gpu crash dump. *** Bug 78983 has been marked as a duplicate of this bug. *** *** Bug 87045 has been marked as a duplicate of this bug. *** *** Bug 87176 has been marked as a duplicate of this bug. *** Created attachment 110698 [details] [review] Add extra flush flags for gen7 invalidate A pair of patches that seem to do the trick... Created attachment 110699 [details] [review] Keep GPU awake for context switches I build Ubuntu kernel's with the two patches applied. After discussion with chris I left out the ringbuffer changes as those were not needed. You can find the patches in my 3.18.0 tree on github.com/fritsch - the latest two of them. Ubuntu kernel debs are here: https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.18.0-ickle75%2B_3.18.0-ickle75%2B-10.00.Custom_amd64.deb https://dl.dropboxusercontent.com/u/55728161/linux-image-3.18.0-ickle75%2B_3.18.0-ickle75%2B-10.00.Custom_amd64.deb Happy testing. Looking very good for now. We have ported the fix to kernel 3.17 and included it into OpenELEC. For now we already have one promissing report. I will keep you informed. Thank you very much. I can confirm that bug is fixed in Peter's kernel. No gpu hangs anymore :) So far no hangs on my C720 Chromebook with the two patches from Peter's repo. Thanks Chris for figuring it out, this bug was very annoying and disruptive. The fix is working great for me. Thank you very much Chris. Working really well here too! :-) I used only the 2nd 3.17 kernel patch from Peter Frühberger (the one without the ringbuffer changes). For reference I am running Arch Linux 3.17 x86_64 with a Haswell Pentium G3240 CPU. Am testing using XBMC 13.2 Gotham and so far have not had any gpu hangs, frame drops or skips during HD video playback. :-) Thanks very much Chris Wilson for all your hard work on fixing this and to Peter Frühberger. When do you think is it likely this patch may be added to the latest kernel at kernel.org? Sorry for asking but I am unfamiliar with how the Linux kernel patch submission process works. In addition to the above for reference, I am running the following arch linux x86_64 packages: libva 1.4.1-1 libva-intel-driver 1.4.1-1 xf86-video-intel 2.99.916-3 mesa 10.3.5-1 mesa-dri 10.3.5-1 libxvmc 1.0.8-1 Hope this informations helps. Tested and working very well. We are using a ported version in OpenELEC in the VAAPI testing thread. I'm not sure if this is another issue or related to this one but even with the two patches from Peter's repo I'm still experiencing slowdowns and excessive use of RAM with Chromium. Running Chromium with few tabs opened and another application that uses the GPU (like Kodi) will quicken the appearance of slowdown. One might point the slowdowns source as Chromium's excessive use of RAM but I've got 4GB of it. I'm experiencing this for a while but until now the "stuck on render ring" forced me to use i915 kernel parameters or the huge backport from Chris Wilson's development branch so I couldn't be sure this issue will be still exist after resolving the "stuck on render ring". This is usually the output in journald: systemd-coredump[19029]: Process 14662 (chromium) of user 1000 dumped core. chromium.desktop[14628]: [19045:19046:1216/181255:ERROR:gpu_watchdog_thread.cc(253)] The GPU process hung. Terminating after 10000 ms. kernel: Watchdog[19046]: segfault at 0 ip 00007f7cfe3e619b sp 00007f7ce78f75a0 error 6 in chromium[7f7cfa128000+6499000] chromium.desktop[14628]: [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(437)] Failed to establish GPU channel. systemd-coredump[19047]: Process 19045 (chromium) of user 1000 dumped core. chromium.desktop[14628]: [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(461)] Lost UI shared context. gnome-session[8166]: Window manager warning: last_focus_time (252848839) is greater than comparison timestamp (252820277). This most likely represents a buggy client sending inaccurate timestamps in messages such as _NET_ACTIVE_WINDOW. Trying to work around... kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 1 PID: 18930 at drivers/gpu/drm/i915/intel_pm.c:6585 intel_display_power_put+0x15c/0x170 [i915]() kernel: Modules linked in: fuse ctr ccm ecb ath3k btusb bluetooth hid_logitech_dj usbhid hid nvram tpm_infineon snd_hda_codec_hdmi arc4 joydev ath9k mousedev cyapa ath9k_common ath9k_hw coretemp hwmon iTCO_wdt iTCO_vendor_support intel_rapl ath x86_pkg_temp_thermal intel_powerclamp mac80211 kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev crc32c_intel mac_hid snd_hda_codec_realtek chromeos_laptop snd_hda_codec_generic cfg80211 ghash_clmulni_intel cryptd pcspkr serio_raw i915 rfkill i2c_i801 snd_hda_intel shpchp lpc_ich snd_hda_controller fan ac tpm_tis battery snd_hda_codec tpm snd_hwdep drm_kms_helper i2c_designware_pci snd_pcm thermal dw_dmac_pci drm video snd_timer dw_dmac dw_dmac_core gpio_lynxpoint 8250_dw snd soundcore intel_gtt i2c_designware_platform i2c_algo_bit processor i2c_designware_core kernel: spi_pxa2xx_platform button uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core v4l2_common videodev media i2c_core sch_fq_codel ext4 crc16 mbcache jbd2 sd_mod atkbd libps2 i8042 serio sdhci_acpi sdhci led_class mmc_core ahci libahci libata scsi_mod xhci_pci xhci_hcd usbcore usb_common kernel: CPU: 1 PID: 18930 Comm: kworker/1:0 Tainted: G W 3.18.0-1-mainline #3 kernel: Hardware name: Acer Peppy, BIOS 10/18/2013 kernel: Workqueue: events edp_panel_vdd_work [i915] kernel: 0000000000000000 0000000012e4cb13 ffff88003790fd28 ffffffff8154ecb4 kernel: 0000000000000000 0000000000000000 ffff88003790fd68 ffffffff81072bc1 kernel: ffff88003790fd48 ffff88007b22002c 000000000000000b ffff88007b228810 kernel: Call Trace: kernel: [<ffffffff8154ecb4>] dump_stack+0x4e/0x71 kernel: [<ffffffff81072bc1>] warn_slowpath_common+0x81/0xa0 kernel: [<ffffffff81072cda>] warn_slowpath_null+0x1a/0x20 kernel: [<ffffffffa03d3a3c>] intel_display_power_put+0x15c/0x170 [i915] kernel: [<ffffffffa044446d>] pps_unlock+0x3d/0x50 [i915] kernel: [<ffffffffa04480c9>] edp_panel_vdd_work+0x39/0x40 [i915] kernel: [<ffffffff8108b7c5>] process_one_work+0x145/0x400 kernel: [<ffffffff8108bd8b>] worker_thread+0x6b/0x4a0 kernel: [<ffffffff8108bd20>] ? init_pwq.part.22+0x10/0x10 kernel: [<ffffffff81090dfa>] kthread+0xea/0x100 kernel: [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0 kernel: [<ffffffff8155477c>] ret_from_fork+0x7c/0xb0 kernel: [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0 kernel: ---[ end trace a3c190b67c9fbfe4 ]--- Sometimes I also get this one: kernel: [drm:ivybridge_set_fifo_underrun_reporting] *ERROR* uncleared fifo underrun on pipe A I don't know if this is the same issue or not, but I have noticed slow downs on the acer c720 related to swap and the disk cache. My solution has been to set the following in /etc/sysctl.conf: vm.swappiness = 0 vm.dirty_background_bytes = 0 vm.dirty_bytes = 0 vm.dirty_ratio = 20 vm.dirty_background_ratio = 10 vm.dirty_writeback_centisecs = 500 I know you are on arch, so you may not have to do this, but I needed to change /usr/lib/pm-utils/power.d/laptop-mode such that I replaced the vmfiles variable with vmfiles="laptop_mode", otherwise the change would not be permanent. So far this has worked for me to reduce slow downs. (In reply to dhead666 from comment #151) > I'm not sure if this is another issue or related to this one but even with > the two patches from Peter's repo I'm still experiencing slowdowns and > excessive use of RAM with Chromium. > Running Chromium with few tabs opened and another application that uses the > GPU (like Kodi) will quicken the appearance of slowdown. > One might point the slowdowns source as Chromium's excessive use of RAM but > I've got 4GB of it. > > I'm experiencing this for a while but until now the "stuck on render ring" > forced me to use i915 kernel parameters or the huge backport from Chris > Wilson's development branch so I couldn't be sure this issue will be still > exist after resolving the "stuck on render ring". > > This is usually the output in journald: > > systemd-coredump[19029]: Process 14662 (chromium) of user 1000 dumped core. > chromium.desktop[14628]: > [19045:19046:1216/181255:ERROR:gpu_watchdog_thread.cc(253)] The GPU process > hung. Terminating after 10000 ms. > kernel: Watchdog[19046]: segfault at 0 ip 00007f7cfe3e619b sp > 00007f7ce78f75a0 error 6 in chromium[7f7cfa128000+6499000] > chromium.desktop[14628]: > [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(437)] Failed > to establish GPU channel. > systemd-coredump[19047]: Process 19045 (chromium) of user 1000 dumped core. > chromium.desktop[14628]: > [14628:14628:1216/181255:ERROR:gpu_process_transport_factory.cc(461)] Lost > UI shared context. > gnome-session[8166]: Window manager warning: last_focus_time (252848839) is > greater than comparison timestamp (252820277). This most likely represents > a buggy client sending inaccurate timestamps in messages such as > _NET_ACTIVE_WINDOW. Trying to work around... > kernel: ------------[ cut here ]------------ > kernel: WARNING: CPU: 1 PID: 18930 at drivers/gpu/drm/i915/intel_pm.c:6585 > intel_display_power_put+0x15c/0x170 [i915]() > kernel: Modules linked in: fuse ctr ccm ecb ath3k btusb bluetooth > hid_logitech_dj usbhid hid nvram tpm_infineon snd_hda_codec_hdmi arc4 joydev > ath9k mousedev cyapa ath9k_common ath9k_hw coretemp hwmon iTCO_wdt > iTCO_vendor_support intel_rapl ath x86_pkg_temp_thermal intel_powerclamp > mac80211 kvm_intel kvm crct10dif_pclmul crc32_pclmul evdev crc32c_intel > mac_hid snd_hda_codec_realtek chromeos_laptop snd_hda_codec_generic cfg80211 > ghash_clmulni_intel cryptd pcspkr serio_raw i915 rfkill i2c_i801 > snd_hda_intel shpchp lpc_ich snd_hda_controller fan ac tpm_tis battery > snd_hda_codec tpm snd_hwdep drm_kms_helper i2c_designware_pci snd_pcm > thermal dw_dmac_pci drm video snd_timer dw_dmac dw_dmac_core gpio_lynxpoint > 8250_dw snd soundcore intel_gtt i2c_designware_platform i2c_algo_bit > processor i2c_designware_core > kernel: spi_pxa2xx_platform button uvcvideo videobuf2_vmalloc > videobuf2_memops videobuf2_core v4l2_common videodev media i2c_core > sch_fq_codel ext4 crc16 mbcache jbd2 sd_mod atkbd libps2 i8042 serio > sdhci_acpi sdhci led_class mmc_core ahci libahci libata scsi_mod xhci_pci > xhci_hcd usbcore usb_common > kernel: CPU: 1 PID: 18930 Comm: kworker/1:0 Tainted: G W > 3.18.0-1-mainline #3 > kernel: Hardware name: Acer Peppy, BIOS 10/18/2013 > kernel: Workqueue: events edp_panel_vdd_work [i915] > kernel: 0000000000000000 0000000012e4cb13 ffff88003790fd28 ffffffff8154ecb4 > kernel: 0000000000000000 0000000000000000 ffff88003790fd68 ffffffff81072bc1 > kernel: ffff88003790fd48 ffff88007b22002c 000000000000000b ffff88007b228810 > kernel: Call Trace: > kernel: [<ffffffff8154ecb4>] dump_stack+0x4e/0x71 > kernel: [<ffffffff81072bc1>] warn_slowpath_common+0x81/0xa0 > kernel: [<ffffffff81072cda>] warn_slowpath_null+0x1a/0x20 > kernel: [<ffffffffa03d3a3c>] intel_display_power_put+0x15c/0x170 [i915] > kernel: [<ffffffffa044446d>] pps_unlock+0x3d/0x50 [i915] > kernel: [<ffffffffa04480c9>] edp_panel_vdd_work+0x39/0x40 [i915] > kernel: [<ffffffff8108b7c5>] process_one_work+0x145/0x400 > kernel: [<ffffffff8108bd8b>] worker_thread+0x6b/0x4a0 > kernel: [<ffffffff8108bd20>] ? init_pwq.part.22+0x10/0x10 > kernel: [<ffffffff81090dfa>] kthread+0xea/0x100 > kernel: [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0 > kernel: [<ffffffff8155477c>] ret_from_fork+0x7c/0xb0 > kernel: [<ffffffff81090d10>] ? kthread_create_on_node+0x1c0/0x1c0 > kernel: ---[ end trace a3c190b67c9fbfe4 ]--- > > > Sometimes I also get this one: > > kernel: [drm:ivybridge_set_fifo_underrun_reporting] *ERROR* uncleared fifo > underrun on pipe A commit add284a3a2481e759d6bec35f6444c32c8ddc383 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 16 08:44:32 2014 +0000 drm/i915: Force the CS stall for invalidate flushes and commit 2c550183476dfa25641309ae9a28d30feed14379 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Dec 16 10:02:27 2014 +0000 drm/i915: Disable PSMI sleep messages on all rings around context switches in drm-intel-next-fixes. @Hugh Greenberg, thanks but I'm not using swap. I think I'll try to gather more info, properly compare against other GPUs and might try some of the tests in intel-gpu-tools before opening a separate issue on the matter. Anyway, this should be discussed somewhere else so if you've anything else to share you're welcome to do this by email, G+ or even ping me at irc (dhead666@freenode). *** Bug 87571 has been marked as a duplicate of this bug. *** *** Bug 88017 has been marked as a duplicate of this bug. *** *** Bug 88044 has been marked as a duplicate of this bug. *** *** Bug 88341 has been marked as a duplicate of this bug. *** *** Bug 88612 has been marked as a duplicate of this bug. *** *** Bug 88604 has been marked as a duplicate of this bug. *** *** Bug 88839 has been marked as a duplicate of this bug. *** *** Bug 89010 has been marked as a duplicate of this bug. *** *** Bug 89025 has been marked as a duplicate of this bug. *** *** Bug 89065 has been marked as a duplicate of this bug. *** *** Bug 89089 has been marked as a duplicate of this bug. *** *** Bug 89183 has been marked as a duplicate of this bug. *** *** Bug 89531 has been marked as a duplicate of this bug. *** *** Bug 89799 has been marked as a duplicate of this bug. *** *** Bug 89964 has been marked as a duplicate of this bug. *** *** Bug 90165 has been marked as a duplicate of this bug. *** *** Bug 90509 has been marked as a duplicate of this bug. *** *** Bug 90635 has been marked as a duplicate of this bug. *** *** Bug 90659 has been marked as a duplicate of this bug. *** Hi, what did the trick with my Haswell Celeron 2955U: In /etc/default/grub I added to the line GRUB_CMDLINE_LINUX_DEFAULT="" kernel parameters I found in this thread so that the whole line looks like that: GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0" Sine 10 days no gpu hangs in my ubuntu system with Celeron 2995U Thanks to all. *** Bug 90729 has been marked as a duplicate of this bug. *** *** Bug 91024 has been marked as a duplicate of this bug. *** *** Bug 91144 has been marked as a duplicate of this bug. *** I tried the following as mentioned by Winni: GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1 i915.semaphores=0" The screen timed out as usual (still able to move the mouse cursor around, resize windows, see the cursor change, etc.), but this time it never came back. I even tried "ps aux | grep compiz" and "kill -9 ####" on compiz and compiz-decorator, which usually causes it to reload, but I ended up having to restart the computer as I didn't know what else to try. I'm reopening the bug for this reason. If you feel I shouldn't, then perhaps I should reopen my original here: https://bugs.freedesktop.org/show_bug.cgi?id=90659 - let me know. If there's anything I can do to assist in finding a solution, don't hesitate to ask! Thanks, Dave (In reply to Dave from comment #178) > I tried the following as mentioned by Winni: > > GRUB_CMDLINE_LINUX_DEFAULT="drm.debug=0 drm.vblankoffdelay=1 > i915.semaphores=0" This bug has nothing to do with semaphores. Just update your kernel. *** Bug 91932 has been marked as a duplicate of this bug. *** *** Bug 91955 has been marked as a duplicate of this bug. *** *** Bug 92647 has been marked as a duplicate of this bug. *** *** Bug 92763 has been marked as a duplicate of this bug. *** *** Bug 93756 has been marked as a duplicate of this bug. *** *** Bug 95084 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.