Summary: | [BDW/BSW Bisected]igt/gem_reloc_vs_gpu/forked-faulting-reloc-thrashing-hang causes GPU reset fail | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||
Severity: | major | ||||||||||||||
Priority: | high | CC: | christophe.prigent, intel-gfx-bugs | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | All | ||||||||||||||
OS: | Linux (All) | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | BDW, BSW/CHT | i915 features: | GEM/Other | ||||||||||||
Attachments: |
|
Description
lu hua
2015-05-29 02:19:26 UTC
Created attachment 116132 [details]
dmesg
Created attachment 116133 [details]
output
Please bisect. Bisect shows: The first bad commit could be any of: b47161858ba13c9c7e03333132230d66e008dd55 03ade51185596a1d1028531c78fda557f244d676 We cannot bisect more! commit 03ade51185596a1d1028531c78fda557f244d676 Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Mon Apr 27 13:41:18 2015 +0100 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Thu May 21 15:11:43 2015 +0200 drm/i915: Inline check required for object syncing prior to execbuf This trims a little overhead from the common case of not needing to synchronize between rings. v2: execlists is special and likes to duplicate code. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> commit b47161858ba13c9c7e03333132230d66e008dd55 Author: Chris Wilson <chris@chris-wilson.co.uk> AuthorDate: Mon Apr 27 13:41:17 2015 +0100 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Thu May 21 15:11:42 2015 +0200 drm/i915: Implement inter-engine read-read optimisations Currently, we only track the last request globally across all engines. This prevents us from issuing concurrent read requests on e.g. the RCS and BCS engines (or more likely the render and media engines). Without semaphores, we incur costly stalls as we synchronise between rings - greatly impacting the current performance of Broadwell versus Haswell in certain workloads (like video decode). With the introduction of reference counted requests, it is much easier to track the last request per ring, as well as the last global write request so that we can optimise inter-engine read read requests (as well as better optimise certain CPU waits). Bug scrub: Elio could you check if still reproduced? This bug still present with latest configuration on BDW-U Enviroment: xserver checkout xorg-server-1.17.2 drm checkout libdrm-2.4.65 xf86-video-intel checkout 2.99.917 mesa checkout mesa-11.0.4 libva checkout libva-1.6.1 intel-driver checkout 1.6.1 cairo checkout 1.14.2 Broadwell-U Hardware Platform: Lenovo G50 Processor: Intel Core I5-5200 2.20 GHz Software Linux distribution: Ubuntu 14.04.03 LTS 64 bits BIOS:B0CN69WW This bug still present with latest configuration on BDW-U Enviroment: xserver checkout xorg-server-1.17.2 drm checkout libdrm-2.4.65 xf86-video-intel checkout 2.99.917 mesa checkout mesa-11.0.4 libva checkout libva-1.6.1 intel-driver checkout 1.6.1 cairo checkout 1.14.2 Broadwell-U Hardware Platform: Lenovo G50 Processor: Intel Core I5-5200 2.20 GHz Software Linux distribution: Ubuntu 14.04.03 LTS 64 bits BIOS:B0CN69WW Kernel http://vanaheimr.fr.intel.com/shared/out/kernels/drm-intel-testing/WW44_4.3.0-rc6_8707465/ The bisection is a red herring. The issue is a race in the checking of atomic_t reset_counter that for whatever reason appears to be provoked by execlists. Note that your dmesg does not include the culprit. This issue still present on latest drm and nightly kernels with the following configuration : Software configuration : -------------------------------- Ubuntu 14.04.03 x86_64 Xserver : 1.17.4 (commit : 2c7fa2a) libdrm : 2.4.65 (commit :c349616) Xf86-video-intel : 2.99.917 (commit : baec802) Mesa : 11.0.4 (commit : 31bf247) Libva : 1.6.1 (commit : 613eb96) Intel-driver : 1.6.1 (commit : 35858c6) Cairo : 1.14.4 (commit : 0317ee7) Intel-GPU-Tools : 1.12 (commit : 1f9e055) BIOS : 5.6 Kernel : latest drm-intel-testing (4.3.0-rc6-testing) commit 87074657f22e38163e712ca417e1a398d00096b6 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Fri Oct 23 11:56:52 2015 +0200 test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-thrashing-hang BDW-U = Kernel : latest drm-intel-nightly: 2015y-11m-06d-12h-48m-02s UTC integration manifest commit a3b0dec82fdb59c629c4fb9847245b80b0cf69dd Author: Jani Nikula <jani.nikula@intel.com> Date: Fri Nov 6 14:48:23 2015 +0200 test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang BDW-U = fail test : gem_reloc_vs_gpu / forked-thrashing-hang BDW-U = fail Note : The tests never finish it takes more than 10 minutes , attached dmesg and GPU_crash_dump Created attachment 119545 [details]
dmesg-bdw
Created attachment 119546 [details]
GPU_crash_dump_bdw
Al mentioned test are being skipped no matter that we are running them over 2 pipes, sharing configuration: ++ Kernel version : 4.4.4-040404-generic ++ Linux distribution : Ubuntu 15.10 ++ Architecture : 64-bit ++ xf86-video-intel version : 2.99.917 ++ Xorg-Xserver version : 1.17.2 ++ DRM version : 2.4.64 ++ VAAPI version : Intel i965 driver for Intel(R) Broadwell - 1.6.0 ++ Cairo version : 1.14.2 ++ Intel GPU Tools version : Tag [intel-gpu-tools-1.14-74-g431f6c4] / Commit [431f6c4] ++ Kernel driver in use : i915 ++ Bios revision : 5.6 --- Hardware information --- ++ Platform : ++ Motherboard model : ++ Motherboard type : NUC5i7RYB Desktop ++ Motherboard manufacturer : ++ CPU family : Core i7 ++ CPU information : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz ++ GPU Card : Intel Corporation Broadwell-U Integrated Graphics (rev 09) (prog-if 00 [VGA controller]) ++ Memory ram : 8 GB ++ Maximum memory ram allowed : 16 GB ++ Display resolution : ++ CPU's number : 4 ++ Hard drive capacity : 120 GB Please forget last state, the tests still failing with mentioned configuration commit 821ed7df6e2a1dbae243caebcfe21a0a4329fca0 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Sep 9 14:11:53 2016 +0100 drm/i915: Update reset path to fix incomplete requests Update reset path in preparation for engine reset which requires identification of incomplete requests and associated context and fixing their state so that engine can resume correctly after reset. The request that caused the hang will be skipped and head is reset to the start of breadcrumb. This allows us to resume from where we left-off. Since this request didn't complete normally we also need to cleanup elsp queue manually. This is vital if we employ nonblocking request submission where we may have a web of dependencies upon the hung request and so advancing the seqno manually is no longer trivial. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.