Bug 89959

Summary:

[SKL]igt/gem_concurrent_blit subcase sporadically causes system hang and *ERROR* blitter: timed out waiting for forcewake ack to clear.

Product:

DRI

Reporter:

lu hua <huax.lu>

Component:

DRM/Intel

Assignee:

Joonas Lahtinen <joonas.lahtinen>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

critical

Priority:

highest

CC:

christophe.prigent, eero.t.tamminen, gary.c.wang, gavin.hindman, intel-gfx-bugs, james.ausmus, joe.konno, patrik.r.jakobsson, tjaalton, ypwong

Version:

unspecified

Hardware:

All

OS:

Linux (All)

Whiteboard:

i915 platform:

BXT, SKL

i915 features:

GEM/Other, GPU hang

Attachments:

Description	Flags
dmesg	none
dmesg(netconsole)	none
drm/i915: Reset request handling for gen9+	none
error state with gem_concurrent_blit	none
log test	none
drm/i915: Contain the WA_REG macro	none
drm/i915/gen9: Disable gather at set shader bit	none

Description lu hua 2015-04-09 02:17:13 UTC

Created attachment 114969 [details]
dmesg

==System Environment==
--------------------------
Regression: not sure, We disabled it in old test.

Non-working platforms: SKL

==kernel==
--------------------------
drm-intel-nightly/333cf6eed7cae0ee2d6bb1b4c4d421b94f84fc13
commit 333cf6eed7cae0ee2d6bb1b4c4d421b94f84fc13
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Thu Apr 2 11:38:12 2015 +0300

    drm-intel-nightly: 2015y-04m-02d-08h-37m-57s UTC integration manifest

==Bug detailed description==
-----------------------------
It sporadically causes system hang and *ERROR* blitter: timed out waiting for forcewake ack to clear. Fail rate: 1/3.

root@x-skly05:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time ./gem_concurrent_blit --run-subtest gtt-blt-overwrite-source-read-rcs-forked
IGT-Version: 1.10-ga6c3b32 (x86_64) (Linux: 4.0.0-rc6_drm-intel-nightly_333cf6_20150403+ x86_64)
using 2x512 buffers, each 1MiB
Subtest gtt-blt-overwrite-source-read-rcs-forked: SUCCESS (6.178s)

real    0m6.607s
user    0m8.217s
sys     0m9.462s
root@x-skly05:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time ./gem_concurrent_blit --run-subtest gtt-blt-overwrite-source-read-rcs-forked
IGT-Version: 1.10-ga6c3b32 (x86_64) (Linux: 4.0.0-rc6_drm-intel-nightly_333cf6_20150403+ x86_64)
using 2x512 buffers, each 1MiB
Subtest gtt-blt-overwrite-source-read-rcs-forked: SUCCESS (6.384s)

real    0m6.599s
user    0m8.126s
sys     0m9.998s
root@x-skly05:/GFX/Test/Intel_gpu_tools/intel-gpu-tools/tests# time ./gem_concurrent_blit --run-subtest gtt-blt-overwrite-source-read-rcs-forked
IGT-Version: 1.10-ga6c3b32 (x86_64) (Linux: 4.0.0-rc6_drm-intel-nightly_333cf6_20150403+ x86_64)
using 2x512 buffers, each 1MiB

==Reproduce steps==
---------------------------- 
1. time ./gem_concurrent_blit --run-subtest gtt-blt-overwrite-source-read-rcs-forked

Comment 1 Ander Conselvan de Oliveira 2015-05-05 09:23:28 UTC

The dmesg output is truncated. Can you produce another one with the logs leading to the issue?

Comment 2 lu hua 2015-05-06 07:08:20 UTC

(In reply to Ander Conselvan de Oliveira from comment #1)
> The dmesg output is truncated. Can you produce another one with the logs
> leading to the issue?

I add drm.debug=0xe, console=tty0 and console=ttyS0,9600, get the dmesg. I retest it with these 3 parameters, get similar dmesg.
Do you have any parameter can print dmesg as you expect?

Comment 3 Ander Conselvan de Oliveira 2015-05-07 07:19:41 UTC

(In reply to lu hua from comment #2)
> (In reply to Ander Conselvan de Oliveira from comment #1)
> > The dmesg output is truncated. Can you produce another one with the logs
> > leading to the issue?
> 
> I add drm.debug=0xe, console=tty0 and console=ttyS0,9600, get the dmesg. I
> retest it with these 3 parameters, get similar dmesg.
> Do you have any parameter can print dmesg as you expect?

Add loglevel=7 to the kernel command line.

Comment 4 lu hua 2015-05-07 08:01:04 UTC

Created attachment 115618 [details]
dmesg(netconsole)

update dmesg.

Comment 5 wendy.wang 2015-05-12 06:51:54 UTC

Hello Ander,
Would you pls help take a look at the dmesg captured from netconsole?
Thanks.

Comment 6 Mika Kuoppala 2015-05-18 16:53:23 UTC

Can you reproduce similar problem with forcewake with enough runs of
gem_render_copy and gem_render_copy_redux?

Try: 'while ./gem_render_copy ; do date ; done'

Comment 7 ye.tian 2015-05-19 06:06:49 UTC

(In reply to Mika Kuoppala from comment #6)
> Can you reproduce similar problem with forcewake with enough runs of
> gem_render_copy and gem_render_copy_redux?
> 
> Try: 'while ./gem_render_copy ; do date ; done'

I can not reproduce similar problem.

Comment 8 Mika Kuoppala 2015-05-21 13:20:13 UTC

(In reply to ye.tian from comment #7)
> (In reply to Mika Kuoppala from comment #6)
> > Can you reproduce similar problem with forcewake with enough runs of
> > gem_render_copy and gem_render_copy_redux?
> > 
> > Try: 'while ./gem_render_copy ; do date ; done'
> 
> I can not reproduce similar problem.

There is evidence that the forcewake ack clearing error is a symptom of
accessing the blitter ring post reset.

That indicates that our reinitialization after reset might be incomplete on skl.

If 'cat /sys/kernel/debug/dri/0/i915_wedged' is not zero when forcewake ack
errors start to mount, that would support the incomplete initialization hypothesis.

Comment 9 lu hua 2015-05-22 06:52:40 UTC

Test on the latest drm-intel-nightly kernel,It still exists.

Comment 10 Gavin Hindman 2015-06-11 18:33:43 UTC

Any update on this issue?

Comment 11 Mika Kuoppala 2015-06-12 14:42:18 UTC

(In reply to Gavin Hindman from comment #10)
> Any update on this issue?

Update:

The problem is that after a GPU hang caused by this test, the gpu ends up in a state that writing to a reset register is an immediate system hang.

No root cause nor workaround yet.

Comment 12 Joonas Lahtinen 2015-06-15 13:26:58 UTC

This was earlier masked by an another bug (possibly e1000e memory corrupt one) bufmgr would initialize properly only at random occasions, so the testcase could not be run succesfully at all. Now that the bufmgr initializes properly so able to proceed.

The gem_render_copy test runs for several thousand iterations without crashing. I'm getting hangs with gem_concurrent_blit now, but they're different than the earlier reported ones;

[  681.882688] gem_concurrent_blit: starting subtest gtt-render-overwrite-source-forked
[  682.869957] [drm] RC6 on
[  686.869348] [drm] stuck on render ring
[  686.873886] [drm] GPU HANG: ecode 9:0:0x85dffffb, in gem_concurrent_ [12313], reason: Ring hung, action: reset
[  686.887309] drm/i915: Resetting chip after gpu hang
[  686.904592] gem_concurrent_blit: starting subtest gtt-render-overwrite-source-read-bcs-forked

The error reports do not seem very coherent with single bug, I continue looking at the problem I get.

Comment 13 Mika Kuoppala 2015-06-16 13:40:20 UTC

Created attachment 116532 [details] [review]
drm/i915: Reset request handling for gen9+

Comment 14 Mika Kuoppala 2015-06-16 14:06:48 UTC

(In reply to Mika Kuoppala from comment #13)
> Created attachment 116532 [details] [review] [review]
> drm/i915: Reset request handling for gen9+

Test will hang the gpu and fail but this patch should prevent
the system hang.

Comment 15 ye.tian 2015-06-17 02:35:56 UTC

Test result is the same as [Comment 14].

Comment 16 Gavin Hindman 2015-06-29 03:28:53 UTC

Where are we on this issue?

Comment 17 Timo Aaltonen 2015-07-09 07:01:02 UTC

I was able to reproduce this on a rev04 SKL-H by running xonotic demo on a loop. kernel ~4.2rc1, mesa 10.6.1

Comment 18 Mika Kuoppala 2015-07-16 16:11:51 UTC

Created attachment 117175 [details]
error state with gem_concurrent_blit

Comment 19 Mika Kuoppala 2015-07-16 16:29:24 UTC

Fast way to reproduce is:

gem_concurrent_blit --run-subtest prw-blt-overwrite-source-read-rcs-forked

The forcewake ack to clear error messages are misleading as they result from
execlist interrupt handler accessing already dead gpu. 

These items are _less_ likely to be the root cause as I have disabled them
selectively without impact on occurrence.

- rc6
- dynamic page tables (reproducible with static pdps and aliasing ppgtt)
- execlist scheduling/pre-emption (reproducible with 1 slot entry submission)
- per ctx bb / pre-emption disabling (Arun's patches are already in)


So in the probable list remains:

- flush problem (tlb). Forced flushing per ring/bb didn't have impact.
- context save/restore corruption.
- missing workaround

Comment 20 Rami 2015-08-06 17:01:26 UTC

Created attachment 117569 [details]
log test

Test passed with success.

IGT-Version: 1.11-g6bd42ce (x86_64) (Linux: 4.2.0-rc5-drm-intel-nightly-ww32-patch-bug89739+ x86_64)
using 2x512 buffers, each 1MiB
Subtest prw-blt-overwrite-source-read-rcs-forked: SUCCESS (8.634s)


Hardware
Platform: SKY LAKE Y A0
CPU : Intel(R) Core(TM) m3-6Y30 CPU @ 0.8GHz 4MB (family: 6, model: 78  stepping: 3)
MCP : SKL-Y  D1  2+2
QDF : QVY3 
CPU : SKL D1
Chipset PCH: Sunrise Point LP C1       
CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2
Reworks : All Mandatories + FBS02 & FBS03, O-06
Software 
Kernel : drm-intel-nightly fb4572c00fadc1ac94816061e76c65b65607f66a 4.2.0-rc5 from git://anongit.freedesktop.org/drm-intel
Bios: SKLSE2R1.R00.X093.1507222151
ME FW : 11.0.0.1165
Ksc (EC FW): 1.16
drm: (HEAD, origin/master, origin/HEAD, master) fc083322b0c8a58b51976adf23a582bce8bb75f1 from git://git.freedesktop.org/git/mesa/drm
intel-driver: (HEAD, origin/master, origin/HEAD, master) 611d8ea9d75dc026c203e3ebe53b434769d4587c from git://git.freedesktop.org/git/vaapi/intel-driver
libva: (HEAD, origin/master, origin/HEAD, master) 70b80c0dd2effb4956b208775641f7c68a67a9df from git://git.freedesktop.org/git/vaapi/libva
mesa: (HEAD, origin/master, origin/HEAD, master) 1b2b0e42ce47bfd1fcb5513ed2c23b9bb7a5a5b8 from git://git.freedesktop.org/git/mesa/mesa
xf86-video-intel: (HEAD, origin/master, origin/HEAD, master) 4246c63347290390a2104739c719f5ff6a05a0e2 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel
xserver: (HEAD, origin/master, origin/HEAD, master) ea03e314f98e5d8ed7bf7a508006a3d84014bde5 from git://git.freedesktop.org/git/xorg/xserver


Kernel commit log:
commit fb4572c00fadc1ac94816061e76c65b65607f66a
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Aug 5 17:33:52 2015 +0200
drm-intel-nightly: 2015y-08m-05d-15h-33m-02s UTC integration manifest

Comment 21 Joonas Lahtinen 2015-08-07 12:44:59 UTC

This has been resolved for SKL-Y with the commit;

http://cgit.freedesktop.org/drm-intel/commit/?id=245d96670d2655f70f4445d5247f26afbe705c84

We're still working intensively for resolving it for other SKL platforms.

Comment 22 Joonas Lahtinen 2015-08-07 12:48:18 UTC

Adding GPU hang tag, as this error message is presumably caused by the hanging GPU, and is seen on BXT too.

Comment 23 Mika Kuoppala 2015-08-11 14:53:04 UTC

Created attachment 117628 [details] [review]
drm/i915: Contain the WA_REG macro

Comment 24 Mika Kuoppala 2015-08-11 14:53:56 UTC

Created attachment 117629 [details] [review]
drm/i915/gen9: Disable gather at set shader bit

Comment 25 Mika Kuoppala 2015-08-11 14:55:45 UTC

Please test with both 
drm/i915: Contain the WA_REG macro
drm/i915/gen9: Disable gather at set shader bit
patches applied.

Comment 26 Timo Aaltonen 2015-08-11 20:29:47 UTC

the new patches work fine on my SKL-Y SDP

Comment 27 dog 2015-08-12 13:59:40 UTC

Note that the customer who is impacted by this bug is using platforms SKL-S and SKL-H.  Does anyone have these platforms so the latest patches can be tested?

Also, the bug status is NEEDINFO.  What additional info is needed at this point?

Comment 28 Timo Aaltonen 2015-08-12 14:12:48 UTC

I do, and testing the various patches on nightly to see what changes

Comment 29 Joonas Lahtinen 2015-08-13 13:58:46 UTC

We're awaiting for QA to do a test run on all variants, but are quite convinced that this specific bug is resolved. The WebGL hang is most likely an entirely different bug, and should be reported separately with reproduction instructions.

Fixing the status to assigned, we're just waiting for confirmation that everyone is having these gem_concurrent_blit subtests working.

Comment 30 Mika Kuoppala 2015-08-13 14:43:36 UTC

commit dd82494724c1c11ceeeaac66a2ed0113ec13f8e4
Author: Arun Siluvery <arun.siluvery@linux.intel.com>
Date:   Wed Aug 12 12:26:01 2015 +0100

    lib/rendercopy_gen9: Setup Push constant pointer before sending BTP commands


Please retest gem_concurrent_blit with latest drm-intel-nightly on igt/master on skl variants you have.

Comment 31 cprigent 2015-08-14 09:14:40 UTC

Tested on SKL-S: test seems to never end. CPU usage is 100%. We are able to end the test with ctrl-z, no crash.
Test gem_render_copy executed just after is pass.

SKY LAKE-S1 (QA Lab) 
Hardware
Platform: SKY LAKE DT Pre Alpha
CPU : Intel(R) CPU  @ 2.3GHz (family: 6, model: 94  stepping: 0)
SoC : SKL-S (DT) 4+2 P0
Chipset PCH: Sunrise Point A0               
CRB : SKY LAKE DT DDR4 RVP08 CRB FAB1
Software 
Linux : Ubuntu 14.04 LTS 64 bits
Kernel : drm-intel-nightly cbad75dd043b82a0684e94fe3b5f994fe0c3414b  (4.2.0-rc5) from git://anongit.freedesktop.org/drm-intel 
BIOS : SKLSE2R1.86C.B066.R00.1412251115
ME FW : 11.0.0.1100
Ksc (EC FW): 1.00
Mesa: mesa-10.6.3 ddc976368fef367e464472ebcc2ac4fd89eb9fd8 from http://cgit.freedesktop.org/mesa/mesa/
Xf86_video_intel: 2.99.917 baec802b21387d04aebb10ac29e719a1800c5aa0 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/
Libdrm: libdrm-2.4.62 ba4b5ac010ab85406ec52e3906e13d58cd9aa782 from http://cgit.freedesktop.org/mesa/drm/
Cairo: 1.14.2 from 93422b3cb5e0ef8104b8194c8873124ce2f5ea2d http://cgit.freedesktop.org/cairo
libva: libva-1.6.0 a8008998bc0d4a76ae6927607c048e52ba50fd0e from http://cgit.freedesktop.org/libva/
intel-driver: 1.6.0 32268c46d538667d437dc9266aa4c183e51c1286 from http://cgit.freedesktop.org/vaapi/intel-driver
Xserver: xorg-server-1.17.2 2123f7682d522619f101b05fb75efa75dabbe371 from http://cgit.freedesktop.org/xorg/xserver 

Kernel commit log:
commit cbad75dd043b82a0684e94fe3b5f994fe0c3414b
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Aug 11 12:08:05 2015 +0200
drm-intel-nightly: 2015y-08m-11d-10h-07m-28s UTC integration manifest

igt 1.11-5c07135b

Comment 32 cprigent 2015-08-14 09:53:18 UTC

Tested again on SKL-S after a full power cycle (last time I first reproduced the GPU hang with last version of igt, reset was not enough).
Test worked several times then GPU hang is reproduced.

Comment 33 cprigent 2015-08-14 10:04:59 UTC

Tested again on SKL-S after a full power cycle (last time I first reproduced the GPU hang with last version of igt, reset was not enough).
Test works several times (from 4 to 10 iterations) then GPU hang is reproduced.

Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831098] [drm] stuck on render ring
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831101] [drm] stuck on bsd ring
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831102] [drm] stuck on blitter ring
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831104] [drm] stuck on video enhancement ring
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831601] [drm] GPU HANG: ecode 9:0:0x85d7fffb, in gem_concurrent_ [4417], reason: Ring hung, action: reset
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831666] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831667] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831669] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831670] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [  235.831671] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [  236.536375] [drm:gen8_do_reset [i915]] *ERROR* bsd ring: reset request timeout
Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [  236.538521] drm/i915: Resetting chip after gpu hang
Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [  236.538535] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

Comment 34 cprigent 2015-08-14 16:21:58 UTC

Test executed successfully 100 times on SKL-U and SKL-Y.

SKL-U
Platform: SKY LAKE U A0 
CPU : Intel(R) CPU @ 1.6GHz (family:6, model: 78 stepping: 2) 
MCP : SKL-Y C1 2+3 ( ULX-C1)
QDF : QJ57
CPU : SKL C1
Chipset PCH: Sunrise Point LP B1 
CRB : SKY LAKE U A0 DDR3L RVP7 FAB1
Reworks : All Mandatories
Software
BIOS : SKLSE2R1.R00.X090.B01.1506290657
ME FW : 11.0.0.1158
Ksc (EC FW): 1.15

SKL-Y
CPU : Intel(R) Core(TM) m3-6Y30 CPU @ 0.8GHz 4MB (family: 6, model: 78 stepping: 3) MCP : SKL-Y D1 2+2 QDF : QVY3 CPU : SKL D1
Chipset PCH: Sunrise Point LP C1 
CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2
Reworks : All Mandatories + FBS02 & FBS03, O-06
Software
BIOS : SKLSE2R1.R00.X093.B02.1507222151
ME FW : 11.0.0.1157
Ksc (EC FW): 1.15

Same kernel, GFX stack and IGT versions.

Comment 35 dog 2015-08-19 00:35:00 UTC

If QA has tested for this bug and has not been able to reproduce it, is it time to close the bug as fixed?  If not, what else needs to be done?

Comment 36 cprigent 2015-08-19 08:35:52 UTC

Bug scrub:
Let's QA do a final verification
- with 500 iterations like in the Jira bug
- with including the fix for the last regression (driver regression identified debugging for the original problem)
- with last CPU on SKL-U
- also on SKL-H

Comment 37 Gavin Hindman 2015-08-20 17:45:35 UTC

What was the verification result?

Comment 38 cprigent 2015-08-21 17:21:57 UTC

- SKL-U does not boot with new CPU (we asked support)
Not reproduced with old CPU, fresh setup and 500 iterations
- SKL-H now boots but with power supply problem (we asked support), we will report for this platform if needed
- SKL-Y: not reproduced with fresh setup and 500 iterations
- SKL-S: no HW upgrade available, not reproduced according to Mika Kuoppala.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.