Created attachment 116131 [details] error state ==System Environment== -------------------------- Regression: yes good commit: 65de797816eadb227c45b0127d7ff92410fa3814(dinq) bad commit: 99c044d7d5cc65661436f271754c011d0f1a02de(dinq) Non-working platforms: BDW/BSW ==kernel== -------------------------- drm-intel-nightly/b44f6771cba2cc90525d037445330ed766377aa9 commit b44f6771cba2cc90525d037445330ed766377aa9 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu May 28 13:39:29 2015 +0200 drm-intel-nightly: 2015y-05m-28d-11h-38m-51s UTC integration manifest ==Bug detailed description== ----------------------------- Run ./gem_reloc_vs_gpu --run-subtest forked-faulting-reloc-thrashing-hang, gpu reset fail. Following cases also have this issue: igt@gem_reloc_vs_gpu@forked-interruptible-thrashing-hang igt@gem_reloc_vs_gpu@forked-thrashing-hang dmesg: [ 91.753899] [drm] stuck on blitter ring [ 91.754663] [drm] GPU HANG: ecode 8:2:0xe77ffff2, in gem_reloc_vs_gp [4986], reason: Ring hung, action: reset [ 91.754665] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 91.754666] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 91.754668] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 91.754669] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 91.754670] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 91.754705] [drm:i915_reset_and_wakeup] resetting chip [ 101.748383] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748413] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748442] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748477] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748500] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748525] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748547] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748570] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.748617] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete [ 101.750656] Setting dangerous option prefault_disable - tainting kernel [ 101.751194] Setting dangerous option prefault_disable - tainting kernel [ 101.751291] Setting dangerous option prefault_disable - tainting kernel [ 240.060726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 240.060767] kworker/u16:3 D ffff8800a7c77aa8 0 1237 2 0x00000000 [ 240.060797] Workqueue: i915-hangcheck i915_hangcheck_elapsed [i915] [ 240.060799] ffff8800a7c77aa8 ffff880002ae0000 ffff8800a7f82120 ffff8800a7c77ad8 [ 240.060802] 0000000000000246 0000000000000000 ffff8800a7c78000 0000000000000246 [ 240.060805] 0000000000000000 ffff88000355c068 ffff8800a7f82120 ffff8800a7c77ac8 [ 240.060808] Call Trace: [ 240.060814] [<ffffffff81896db4>] schedule+0x75/0x84 [ 240.060816] [<ffffffff81897011>] schedule_preempt_disabled+0xe/0x10 [ 240.060818] [<ffffffff818986c5>] mutex_lock_nested+0x17c/0x2cb [ 240.060833] [<ffffffffa0094a13>] ? i915_reset+0x3a/0x13e [i915] [ 240.060847] [<ffffffffa0094a13>] i915_reset+0x3a/0x13e [i915] [ 240.060866] [<ffffffffa00c80e2>] i915_reset_and_wakeup+0xd3/0x133 [i915] [ 240.060885] [<ffffffffa00cbd51>] i915_handle_error+0x5ab/0x5bd [i915] [ 240.060905] [<ffffffffa00dda30>] ? gen6_read32+0x11a/0x18b [i915] [ 240.060910] [<ffffffff8109352f>] ? vprintk_default+0x1d/0x1f [ 240.060913] [<ffffffff8188f3e9>] ? printk+0x46/0x48 [ 240.060930] [<ffffffffa00cc14f>] i915_hangcheck_elapsed+0x3a3/0x3c3 [i915] [ 240.060933] [<ffffffff8105ab88>] ? process_one_work+0x1ba/0x409 [ 240.060935] [<ffffffff8105abf3>] process_one_work+0x225/0x409 [ 240.060937] [<ffffffff8105ab74>] ? process_one_work+0x1a6/0x409 [ 240.060940] [<ffffffff8105b694>] worker_thread+0x275/0x369 [ 240.060942] [<ffffffff8107c63a>] ? complete+0x42/0x4a [ 240.060944] [<ffffffff8105b41f>] ? cancel_delayed_work_sync+0x15/0x15 [ 240.060947] [<ffffffff81060039>] kthread+0xf6/0xfe [ 240.060950] [<ffffffff8105ff43>] ? kthread_create_on_node+0x1ac/0x1ac [ 240.060953] [<ffffffff8189b892>] ret_from_fork+0x42/0x70 [ 240.060955] [<ffffffff8105ff43>] ? kthread_create_on_node+0x1ac/0x1ac [ 240.060957] INFO: lockdep is turned off. [ 240.060966] INFO: task gem_reloc_vs_gp:4986 blocked for more than 120 seconds. ==Reproduce steps== ---------------------------- 1. ./gem_reloc_vs_gpu --run-subtest forked-faulting-reloc-thrashing-hang
Created attachment 116132 [details] dmesg
Created attachment 116133 [details] output
Please bisect.
Bisect shows: The first bad commit could be any of: b47161858ba13c9c7e03333132230d66e008dd55 03ade51185596a1d1028531c78fda557f244d676 We cannot bisect more! commit 03ade51185596a1d1028531c78fda557f244d676 Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Mon Apr 27 13:41:18 2015 +0100 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Thu May 21 15:11:43 2015 +0200 drm/i915: Inline check required for object syncing prior to execbuf This trims a little overhead from the common case of not needing to synchronize between rings. v2: execlists is special and likes to duplicate code. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> commit b47161858ba13c9c7e03333132230d66e008dd55 Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Mon Apr 27 13:41:17 2015 +0100 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Thu May 21 15:11:42 2015 +0200 drm/i915: Implement inter-engine read-read optimisations Currently, we only track the last request globally across all engines. This prevents us from issuing concurrent read requests on e.g. the RCS and BCS engines (or more likely the render and media engines). Without semaphores, we incur costly stalls as we synchronise between rings - greatly impacting the current performance of Broadwell versus Haswell in certain workloads (like video decode). With the introduction of reference counted requests, it is much easier to track the last request per ring, as well as the last global write request so that we can optimise inter-engine read read requests (as well as better optimise certain CPU waits).
Bug scrub: Elio could you check if still reproduced?
This bug still present with latest configuration on BDW-U Enviroment: xserver checkout xorg-server-1.17.2 drm checkout libdrm-2.4.65 xf86-video-intel checkout 2.99.917 mesa checkout mesa-11.0.4 libva checkout libva-1.6.1 intel-driver checkout 1.6.1 cairo checkout 1.14.2 Broadwell-U Hardware Platform: Lenovo G50 Processor: Intel Core I5-5200 2.20 GHz Software Linux distribution: Ubuntu 14.04.03 LTS 64 bits BIOS:B0CN69WW
This bug still present with latest configuration on BDW-U Enviroment: xserver checkout xorg-server-1.17.2 drm checkout libdrm-2.4.65 xf86-video-intel checkout 2.99.917 mesa checkout mesa-11.0.4 libva checkout libva-1.6.1 intel-driver checkout 1.6.1 cairo checkout 1.14.2 Broadwell-U Hardware Platform: Lenovo G50 Processor: Intel Core I5-5200 2.20 GHz Software Linux distribution: Ubuntu 14.04.03 LTS 64 bits BIOS:B0CN69WW Kernel http://vanaheimr.fr.intel.com/shared/out/kernels/drm-intel-testing/WW44_4.3.0-rc6_8707465/
The bisection is a red herring. The issue is a race in the checking of atomic_t reset_counter that for whatever reason appears to be provoked by execlists. Note that your dmesg does not include the culprit.
This issue still present on latest drm and nightly kernels with the following configuration : Software configuration : -------------------------------- Ubuntu 14.04.03 x86_64 Xserver : 1.17.4 (commit : 2c7fa2a) libdrm : 2.4.65 (commit :c349616) Xf86-video-intel : 2.99.917 (commit : baec802) Mesa : 11.0.4 (commit : 31bf247) Libva : 1.6.1 (commit : 613eb96) Intel-driver : 1.6.1 (commit : 35858c6) Cairo : 1.14.4 (commit : 0317ee7) Intel-GPU-Tools : 1.12 (commit : 1f9e055) BIOS : 5.6 Kernel : latest drm-intel-testing (4.3.0-rc6-testing) commit 87074657f22e38163e712ca417e1a398d00096b6 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Oct 23 11:56:52 2015 +0200 test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-thrashing-hang BDW-U = Kernel : latest drm-intel-nightly: 2015y-11m-06d-12h-48m-02s UTC integration manifest commit a3b0dec82fdb59c629c4fb9847245b80b0cf69dd Author: Jani Nikula <jani.nikula@intel.com> Date: Fri Nov 6 14:48:23 2015 +0200 test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-thrashing-hang BDW-U = fail Note : The tests never finish it takes more than 10 minutes , attached dmesg and GPU_crash_dump
Created attachment 119545 [details] dmesg-bdw
Created attachment 119546 [details] GPU_crash_dump_bdw
Al mentioned test are being skipped no matter that we are running them over 2 pipes, sharing configuration: ++ Kernel version : 4.4.4-040404-generic ++ Linux distribution : Ubuntu 15.10 ++ Architecture : 64-bit ++ xf86-video-intel version : 2.99.917 ++ Xorg-Xserver version : 1.17.2 ++ DRM version : 2.4.64 ++ VAAPI version : Intel i965 driver for Intel(R) Broadwell - 1.6.0 ++ Cairo version : 1.14.2 ++ Intel GPU Tools version : Tag [intel-gpu-tools-1.14-74-g431f6c4] / Commit [431f6c4] ++ Kernel driver in use : i915 ++ Bios revision : 5.6 --- Hardware information --- ++ Platform : ++ Motherboard model : ++ Motherboard type : NUC5i7RYB Desktop ++ Motherboard manufacturer : ++ CPU family : Core i7 ++ CPU information : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz ++ GPU Card : Intel Corporation Broadwell-U Integrated Graphics (rev 09) (prog-if 00 [VGA controller]) ++ Memory ram : 8 GB ++ Maximum memory ram allowed : 16 GB ++ Display resolution : ++ CPU's number : 4 ++ Hard drive capacity : 120 GB
Please forget last state, the tests still failing with mentioned configuration
commit 821ed7df6e2a1dbae243caebcfe21a0a4329fca0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 9 14:11:53 2016 +0100 drm/i915: Update reset path to fix incomplete requests Update reset path in preparation for engine reset which requires identification of incomplete requests and associated context and fixing their state so that engine can resume correctly after reset. The request that caused the hang will be skipped and head is reset to the start of breadcrumb. This allows us to resume from where we left-off. Since this request didn't complete normally we also need to cleanup elsp queue manually. This is vital if we employ nonblocking request submission where we may have a web of dependencies upon the hung request and so advancing the seqno manually is no longer trivial.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.