90732 – [BDW/BSW Bisected]igt/gem_reloc_vs_gpu/forked-faulting-reloc-thrashing-hang causes GPU reset fail

Bug 90732 - [BDW/BSW Bisected]igt/gem_reloc_vs_gpu/forked-faulting-reloc-thrashing-hang causes GPU reset fail

Summary: [BDW/BSW Bisected]igt/gem_reloc_vs_gpu/forked-faulting-reloc-thrashing-hang c...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-05-29 02:19 UTC by lu hua
Modified:	2016-09-12 09:15 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	BDW, BSW/CHT
i915 features:	GEM/Other

Attachments
error state (2.79 MB, text/plain) 2015-05-29 02:19 UTC, lu hua	no flags	Details
dmesg (124.71 KB, text/plain) 2015-05-29 02:20 UTC, lu hua	no flags	Details
output (6.74 KB, text/plain) 2015-05-29 02:20 UTC, lu hua	no flags	Details
dmesg-bdw (216.75 KB, text/plain) 2015-11-10 19:07 UTC, Humberto Israel Perez Rodriguez	no flags	Details
GPU_crash_dump_bdw (464.41 KB, text/plain) 2015-11-10 19:08 UTC, Humberto Israel Perez Rodriguez	no flags	Details
View All

Description lu hua 2015-05-29 02:19:26 UTC

Created attachment 116131 [details]
error state

==System Environment==
--------------------------
Regression: yes

good commit:  65de797816eadb227c45b0127d7ff92410fa3814(dinq)
bad commit: 99c044d7d5cc65661436f271754c011d0f1a02de(dinq)

Non-working platforms: BDW/BSW

==kernel==
--------------------------
drm-intel-nightly/b44f6771cba2cc90525d037445330ed766377aa9
commit b44f6771cba2cc90525d037445330ed766377aa9
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu May 28 13:39:29 2015 +0200

    drm-intel-nightly: 2015y-05m-28d-11h-38m-51s UTC integration manifest


==Bug detailed description==
-----------------------------
Run ./gem_reloc_vs_gpu --run-subtest forked-faulting-reloc-thrashing-hang, gpu reset fail.
Following cases also have this issue:
igt@gem_reloc_vs_gpu@forked-interruptible-thrashing-hang
igt@gem_reloc_vs_gpu@forked-thrashing-hang

dmesg:
[   91.753899] [drm] stuck on blitter ring
[   91.754663] [drm] GPU HANG: ecode 8:2:0xe77ffff2, in gem_reloc_vs_gp [4986], reason: Ring hung, action: reset
[   91.754665] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   91.754666] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   91.754668] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   91.754669] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   91.754670] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   91.754705] [drm:i915_reset_and_wakeup] resetting chip
[  101.748383] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748413] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748442] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748477] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748500] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748525] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748547] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748570] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.748617] [drm:i915_gem_wait_for_error.part.25 [i915]] *ERROR* Timed out waiting for the gpu reset to complete
[  101.750656] Setting dangerous option prefault_disable - tainting kernel
[  101.751194] Setting dangerous option prefault_disable - tainting kernel
[  101.751291] Setting dangerous option prefault_disable - tainting kernel

[  240.060726] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  240.060767] kworker/u16:3   D ffff8800a7c77aa8     0  1237      2 0x00000000
[  240.060797] Workqueue: i915-hangcheck i915_hangcheck_elapsed [i915]
[  240.060799]  ffff8800a7c77aa8 ffff880002ae0000 ffff8800a7f82120 ffff8800a7c77ad8
[  240.060802]  0000000000000246 0000000000000000 ffff8800a7c78000 0000000000000246
[  240.060805]  0000000000000000 ffff88000355c068 ffff8800a7f82120 ffff8800a7c77ac8
[  240.060808] Call Trace:
[  240.060814]  [<ffffffff81896db4>] schedule+0x75/0x84
[  240.060816]  [<ffffffff81897011>] schedule_preempt_disabled+0xe/0x10
[  240.060818]  [<ffffffff818986c5>] mutex_lock_nested+0x17c/0x2cb
[  240.060833]  [<ffffffffa0094a13>] ? i915_reset+0x3a/0x13e [i915]
[  240.060847]  [<ffffffffa0094a13>] i915_reset+0x3a/0x13e [i915]
[  240.060866]  [<ffffffffa00c80e2>] i915_reset_and_wakeup+0xd3/0x133 [i915]
[  240.060885]  [<ffffffffa00cbd51>] i915_handle_error+0x5ab/0x5bd [i915]
[  240.060905]  [<ffffffffa00dda30>] ? gen6_read32+0x11a/0x18b [i915]
[  240.060910]  [<ffffffff8109352f>] ? vprintk_default+0x1d/0x1f
[  240.060913]  [<ffffffff8188f3e9>] ? printk+0x46/0x48
[  240.060930]  [<ffffffffa00cc14f>] i915_hangcheck_elapsed+0x3a3/0x3c3 [i915]
[  240.060933]  [<ffffffff8105ab88>] ? process_one_work+0x1ba/0x409
[  240.060935]  [<ffffffff8105abf3>] process_one_work+0x225/0x409
[  240.060937]  [<ffffffff8105ab74>] ? process_one_work+0x1a6/0x409
[  240.060940]  [<ffffffff8105b694>] worker_thread+0x275/0x369
[  240.060942]  [<ffffffff8107c63a>] ? complete+0x42/0x4a
[  240.060944]  [<ffffffff8105b41f>] ? cancel_delayed_work_sync+0x15/0x15
[  240.060947]  [<ffffffff81060039>] kthread+0xf6/0xfe
[  240.060950]  [<ffffffff8105ff43>] ? kthread_create_on_node+0x1ac/0x1ac
[  240.060953]  [<ffffffff8189b892>] ret_from_fork+0x42/0x70
[  240.060955]  [<ffffffff8105ff43>] ? kthread_create_on_node+0x1ac/0x1ac
[  240.060957] INFO: lockdep is turned off.
[  240.060966] INFO: task gem_reloc_vs_gp:4986 blocked for more than 120 seconds.

==Reproduce steps==
---------------------------- 
1.  ./gem_reloc_vs_gpu --run-subtest forked-faulting-reloc-thrashing-hang

Comment 1 lu hua 2015-05-29 02:20:01 UTC

Created attachment 116132 [details]
dmesg

Comment 2 lu hua 2015-05-29 02:20:22 UTC

Created attachment 116133 [details]
output

Comment 3 Ander Conselvan de Oliveira 2015-05-29 06:24:47 UTC

Please bisect.

Comment 4 lu hua 2015-06-23 08:40:29 UTC

Bisect shows:
The first bad commit could be any of:
b47161858ba13c9c7e03333132230d66e008dd55
03ade51185596a1d1028531c78fda557f244d676
We cannot bisect more!

commit 03ade51185596a1d1028531c78fda557f244d676
Author:     Chris Wilson <chris@chris-wilson.co.uk>
AuthorDate: Mon Apr 27 13:41:18 2015 +0100
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Thu May 21 15:11:43 2015 +0200

    drm/i915: Inline check required for object syncing prior to execbuf

    This trims a little overhead from the common case of not needing to
    synchronize between rings.

    v2: execlists is special and likes to duplicate code.

    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

commit b47161858ba13c9c7e03333132230d66e008dd55
Author:     Chris Wilson <chris@chris-wilson.co.uk>
AuthorDate: Mon Apr 27 13:41:17 2015 +0100
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Thu May 21 15:11:42 2015 +0200

    drm/i915: Implement inter-engine read-read optimisations

    Currently, we only track the last request globally across all engines.
    This prevents us from issuing concurrent read requests on e.g. the RCS
    and BCS engines (or more likely the render and media engines). Without
    semaphores, we incur costly stalls as we synchronise between rings -
    greatly impacting the current performance of Broadwell versus Haswell in
    certain workloads (like video decode). With the introduction of
    reference counted requests, it is much easier to track the last request
    per ring, as well as the last global write request so that we can
    optimise inter-engine read read requests (as well as better optimise
    certain CPU waits).

Comment 5 cprigent 2015-10-08 16:51:22 UTC

Bug scrub:
Elio could you check if still reproduced?

Comment 6 Elio 2015-10-27 19:45:19 UTC

This bug still present with latest configuration on BDW-U

Enviroment:

xserver
checkout xorg-server-1.17.2
drm
checkout libdrm-2.4.65
xf86-video-intel
checkout 2.99.917
mesa
checkout mesa-11.0.4
libva
checkout libva-1.6.1
intel-driver
checkout 1.6.1
cairo
checkout 1.14.2
Broadwell-U
Hardware
Platform: Lenovo G50
Processor: Intel Core I5-5200 2.20 GHz
Software
Linux distribution: Ubuntu 14.04.03 LTS 64 bits
BIOS:B0CN69WW

Comment 7 Elio 2015-10-27 19:58:06 UTC

This bug still present with latest configuration on BDW-U

Enviroment:

xserver
checkout xorg-server-1.17.2
drm
checkout libdrm-2.4.65
xf86-video-intel
checkout 2.99.917
mesa
checkout mesa-11.0.4
libva
checkout libva-1.6.1
intel-driver
checkout 1.6.1
cairo
checkout 1.14.2
Broadwell-U
Hardware
Platform: Lenovo G50
Processor: Intel Core I5-5200 2.20 GHz
Software
Linux distribution: Ubuntu 14.04.03 LTS 64 bits
BIOS:B0CN69WW

Kernel
http://vanaheimr.fr.intel.com/shared/out/kernels/drm-intel-testing/WW44_4.3.0-rc6_8707465/

Comment 8 Chris Wilson 2015-10-28 14:31:53 UTC

The bisection is a red herring. The issue is a race in the checking of atomic_t reset_counter that for whatever reason appears to be provoked by execlists. Note that your dmesg does not include the culprit.

Comment 9 Humberto Israel Perez Rodriguez 2015-11-10 19:06:33 UTC

This issue still present on latest drm and nightly kernels with the following configuration : 

Software configuration :
--------------------------------
Ubuntu 14.04.03 x86_64
Xserver : 1.17.4  (commit : 2c7fa2a)
libdrm : 2.4.65 (commit :c349616)
Xf86-video-intel : 2.99.917 (commit : baec802)
Mesa : 11.0.4 (commit : 31bf247)
Libva : 1.6.1 (commit : 613eb96)
Intel-driver : 1.6.1 (commit : 35858c6)
Cairo : 1.14.4 (commit : 0317ee7)
Intel-GPU-Tools : 1.12 (commit : 1f9e055)
BIOS : 5.6


Kernel : latest drm-intel-testing (4.3.0-rc6-testing)
commit 87074657f22e38163e712ca417e1a398d00096b6
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Oct 23 11:56:52 2015 +0200

test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang
BDW-U = fail

test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang
BDW-U = fail

test : gem_reloc_vs_gpu / forked-thrashing-hang
BDW-U = 


Kernel : latest drm-intel-nightly: 2015y-11m-06d-12h-48m-02s UTC integration manifest
commit a3b0dec82fdb59c629c4fb9847245b80b0cf69dd
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Fri Nov 6 14:48:23 2015 +0200

test : gem_reloc_vs_gpu / forked-faulting-reloc-thrashing-hang
BDW-U = fail

test : gem_reloc_vs_gpu / forked-interruptible-thrashing-hang
BDW-U = fail

test : gem_reloc_vs_gpu / forked-thrashing-hang
BDW-U = fail

Note : The tests never finish it takes more than 10 minutes , attached dmesg and GPU_crash_dump

Comment 10 Humberto Israel Perez Rodriguez 2015-11-10 19:07:06 UTC

Created attachment 119545 [details]
dmesg-bdw

Comment 11 Humberto Israel Perez Rodriguez 2015-11-10 19:08:13 UTC

Created attachment 119546 [details]
GPU_crash_dump_bdw

Comment 12 Elio 2016-03-18 17:13:59 UTC

Al mentioned test are being skipped no matter that we are running them over 2 pipes, sharing configuration:

++ Kernel version                      : 4.4.4-040404-generic
 ++ Linux distribution                  : Ubuntu 15.10
 ++ Architecture                        : 64-bit
 
 ++ xf86-video-intel version            : 2.99.917
 ++ Xorg-Xserver version                : 1.17.2
 ++ DRM version                         : 2.4.64
 ++ VAAPI version                       : Intel i965 driver for Intel(R) Broadwell - 1.6.0
 ++ Cairo version                       : 1.14.2
 ++ Intel GPU Tools version             : Tag [intel-gpu-tools-1.14-74-g431f6c4] / Commit [431f6c4]
 ++ Kernel driver in use                : i915
 ++ Bios revision                       : 5.6


 --- Hardware information ---

 ++ Platform                            :
 ++ Motherboard model                   :
 ++ Motherboard type                    : NUC5i7RYB Desktop
 ++ Motherboard manufacturer            :
 ++ CPU family                          : Core i7
 ++ CPU information                     : Intel(R) Core(TM) i7-5557U CPU @ 3.10GHz
 ++ GPU Card                            : Intel Corporation Broadwell-U Integrated Graphics (rev 09) (prog-if 00 [VGA controller])
 ++ Memory ram                          : 8 GB
 ++ Maximum memory ram allowed          : 16 GB
 ++ Display resolution                  :
 ++ CPU's number                        : 4
 ++ Hard drive capacity                 : 120 GB

Comment 13 Elio 2016-03-18 17:20:02 UTC

Please forget last state, the tests still failing with mentioned configuration

Comment 14 Chris Wilson 2016-09-09 17:52:02 UTC

commit 821ed7df6e2a1dbae243caebcfe21a0a4329fca0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Sep 9 14:11:53 2016 +0100

    drm/i915: Update reset path to fix incomplete requests
    
    Update reset path in preparation for engine reset which requires
    identification of incomplete requests and associated context and fixing
    their state so that engine can resume correctly after reset.
    
    The request that caused the hang will be skipped and head is reset to the
    start of breadcrumb. This allows us to resume from where we left-off.
    Since this request didn't complete normally we also need to cleanup elsp
    queue manually. This is vital if we employ nonblocking request
    submission where we may have a web of dependencies upon the hung request
    and so advancing the seqno manually is no longer trivial.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.