Summary: | [SKL]igt/gem_concurrent_blit subcase sporadically causes system hang and *ERROR* blitter: timed out waiting for forcewake ack to clear. | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | lu hua <huax.lu> | ||||||||||||||||
Component: | DRM/Intel | Assignee: | Joonas Lahtinen <joonas.lahtinen> | ||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||
Severity: | critical | ||||||||||||||||||
Priority: | highest | CC: | christophe.prigent, eero.t.tamminen, gary.c.wang, gavin.hindman, intel-gfx-bugs, james.ausmus, joe.konno, patrik.r.jakobsson, tjaalton, ypwong | ||||||||||||||||
Version: | unspecified | ||||||||||||||||||
Hardware: | All | ||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
i915 platform: | BXT, SKL | i915 features: | GEM/Other, GPU hang | ||||||||||||||||
Attachments: |
|
Description
lu hua
2015-04-09 02:17:13 UTC
The dmesg output is truncated. Can you produce another one with the logs leading to the issue? (In reply to Ander Conselvan de Oliveira from comment #1) > The dmesg output is truncated. Can you produce another one with the logs > leading to the issue? I add drm.debug=0xe, console=tty0 and console=ttyS0,9600, get the dmesg. I retest it with these 3 parameters, get similar dmesg. Do you have any parameter can print dmesg as you expect? (In reply to lu hua from comment #2) > (In reply to Ander Conselvan de Oliveira from comment #1) > > The dmesg output is truncated. Can you produce another one with the logs > > leading to the issue? > > I add drm.debug=0xe, console=tty0 and console=ttyS0,9600, get the dmesg. I > retest it with these 3 parameters, get similar dmesg. > Do you have any parameter can print dmesg as you expect? Add loglevel=7 to the kernel command line. Created attachment 115618 [details]
dmesg(netconsole)
update dmesg.
Hello Ander, Would you pls help take a look at the dmesg captured from netconsole? Thanks. Can you reproduce similar problem with forcewake with enough runs of gem_render_copy and gem_render_copy_redux? Try: 'while ./gem_render_copy ; do date ; done' (In reply to Mika Kuoppala from comment #6) > Can you reproduce similar problem with forcewake with enough runs of > gem_render_copy and gem_render_copy_redux? > > Try: 'while ./gem_render_copy ; do date ; done' I can not reproduce similar problem. (In reply to ye.tian from comment #7) > (In reply to Mika Kuoppala from comment #6) > > Can you reproduce similar problem with forcewake with enough runs of > > gem_render_copy and gem_render_copy_redux? > > > > Try: 'while ./gem_render_copy ; do date ; done' > > I can not reproduce similar problem. There is evidence that the forcewake ack clearing error is a symptom of accessing the blitter ring post reset. That indicates that our reinitialization after reset might be incomplete on skl. If 'cat /sys/kernel/debug/dri/0/i915_wedged' is not zero when forcewake ack errors start to mount, that would support the incomplete initialization hypothesis. Test on the latest drm-intel-nightly kernel,It still exists. Any update on this issue? (In reply to Gavin Hindman from comment #10) > Any update on this issue? Update: The problem is that after a GPU hang caused by this test, the gpu ends up in a state that writing to a reset register is an immediate system hang. No root cause nor workaround yet. This was earlier masked by an another bug (possibly e1000e memory corrupt one) bufmgr would initialize properly only at random occasions, so the testcase could not be run succesfully at all. Now that the bufmgr initializes properly so able to proceed. The gem_render_copy test runs for several thousand iterations without crashing. I'm getting hangs with gem_concurrent_blit now, but they're different than the earlier reported ones; [ 681.882688] gem_concurrent_blit: starting subtest gtt-render-overwrite-source-forked [ 682.869957] [drm] RC6 on [ 686.869348] [drm] stuck on render ring [ 686.873886] [drm] GPU HANG: ecode 9:0:0x85dffffb, in gem_concurrent_ [12313], reason: Ring hung, action: reset [ 686.887309] drm/i915: Resetting chip after gpu hang [ 686.904592] gem_concurrent_blit: starting subtest gtt-render-overwrite-source-read-bcs-forked The error reports do not seem very coherent with single bug, I continue looking at the problem I get. Created attachment 116532 [details] [review] drm/i915: Reset request handling for gen9+ (In reply to Mika Kuoppala from comment #13) > Created attachment 116532 [details] [review] [review] > drm/i915: Reset request handling for gen9+ Test will hang the gpu and fail but this patch should prevent the system hang. Test result is the same as [Comment 14]. Where are we on this issue? I was able to reproduce this on a rev04 SKL-H by running xonotic demo on a loop. kernel ~4.2rc1, mesa 10.6.1 Created attachment 117175 [details]
error state with gem_concurrent_blit
Fast way to reproduce is: gem_concurrent_blit --run-subtest prw-blt-overwrite-source-read-rcs-forked The forcewake ack to clear error messages are misleading as they result from execlist interrupt handler accessing already dead gpu. These items are _less_ likely to be the root cause as I have disabled them selectively without impact on occurrence. - rc6 - dynamic page tables (reproducible with static pdps and aliasing ppgtt) - execlist scheduling/pre-emption (reproducible with 1 slot entry submission) - per ctx bb / pre-emption disabling (Arun's patches are already in) So in the probable list remains: - flush problem (tlb). Forced flushing per ring/bb didn't have impact. - context save/restore corruption. - missing workaround Created attachment 117569 [details] log test Test passed with success. IGT-Version: 1.11-g6bd42ce (x86_64) (Linux: 4.2.0-rc5-drm-intel-nightly-ww32-patch-bug89739+ x86_64) using 2x512 buffers, each 1MiB Subtest prw-blt-overwrite-source-read-rcs-forked: SUCCESS (8.634s) Hardware Platform: SKY LAKE Y A0 CPU : Intel(R) Core(TM) m3-6Y30 CPU @ 0.8GHz 4MB (family: 6, model: 78 stepping: 3) MCP : SKL-Y D1 2+2 QDF : QVY3 CPU : SKL D1 Chipset PCH: Sunrise Point LP C1 CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2 Reworks : All Mandatories + FBS02 & FBS03, O-06 Software Kernel : drm-intel-nightly fb4572c00fadc1ac94816061e76c65b65607f66a 4.2.0-rc5 from git://anongit.freedesktop.org/drm-intel Bios: SKLSE2R1.R00.X093.1507222151 ME FW : 11.0.0.1165 Ksc (EC FW): 1.16 drm: (HEAD, origin/master, origin/HEAD, master) fc083322b0c8a58b51976adf23a582bce8bb75f1 from git://git.freedesktop.org/git/mesa/drm intel-driver: (HEAD, origin/master, origin/HEAD, master) 611d8ea9d75dc026c203e3ebe53b434769d4587c from git://git.freedesktop.org/git/vaapi/intel-driver libva: (HEAD, origin/master, origin/HEAD, master) 70b80c0dd2effb4956b208775641f7c68a67a9df from git://git.freedesktop.org/git/vaapi/libva mesa: (HEAD, origin/master, origin/HEAD, master) 1b2b0e42ce47bfd1fcb5513ed2c23b9bb7a5a5b8 from git://git.freedesktop.org/git/mesa/mesa xf86-video-intel: (HEAD, origin/master, origin/HEAD, master) 4246c63347290390a2104739c719f5ff6a05a0e2 from git://git.freedesktop.org/git/xorg/driver/xf86-video-intel xserver: (HEAD, origin/master, origin/HEAD, master) ea03e314f98e5d8ed7bf7a508006a3d84014bde5 from git://git.freedesktop.org/git/xorg/xserver Kernel commit log: commit fb4572c00fadc1ac94816061e76c65b65607f66a Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Aug 5 17:33:52 2015 +0200 drm-intel-nightly: 2015y-08m-05d-15h-33m-02s UTC integration manifest This has been resolved for SKL-Y with the commit; http://cgit.freedesktop.org/drm-intel/commit/?id=245d96670d2655f70f4445d5247f26afbe705c84 We're still working intensively for resolving it for other SKL platforms. Adding GPU hang tag, as this error message is presumably caused by the hanging GPU, and is seen on BXT too. Created attachment 117628 [details] [review] drm/i915: Contain the WA_REG macro Created attachment 117629 [details] [review] drm/i915/gen9: Disable gather at set shader bit Please test with both drm/i915: Contain the WA_REG macro drm/i915/gen9: Disable gather at set shader bit patches applied. the new patches work fine on my SKL-Y SDP Note that the customer who is impacted by this bug is using platforms SKL-S and SKL-H. Does anyone have these platforms so the latest patches can be tested? Also, the bug status is NEEDINFO. What additional info is needed at this point? I do, and testing the various patches on nightly to see what changes We're awaiting for QA to do a test run on all variants, but are quite convinced that this specific bug is resolved. The WebGL hang is most likely an entirely different bug, and should be reported separately with reproduction instructions. Fixing the status to assigned, we're just waiting for confirmation that everyone is having these gem_concurrent_blit subtests working. commit dd82494724c1c11ceeeaac66a2ed0113ec13f8e4 Author: Arun Siluvery <arun.siluvery@linux.intel.com> Date: Wed Aug 12 12:26:01 2015 +0100 lib/rendercopy_gen9: Setup Push constant pointer before sending BTP commands Please retest gem_concurrent_blit with latest drm-intel-nightly on igt/master on skl variants you have. Tested on SKL-S: test seems to never end. CPU usage is 100%. We are able to end the test with ctrl-z, no crash. Test gem_render_copy executed just after is pass. SKY LAKE-S1 (QA Lab) Hardware Platform: SKY LAKE DT Pre Alpha CPU : Intel(R) CPU @ 2.3GHz (family: 6, model: 94 stepping: 0) SoC : SKL-S (DT) 4+2 P0 Chipset PCH: Sunrise Point A0 CRB : SKY LAKE DT DDR4 RVP08 CRB FAB1 Software Linux : Ubuntu 14.04 LTS 64 bits Kernel : drm-intel-nightly cbad75dd043b82a0684e94fe3b5f994fe0c3414b (4.2.0-rc5) from git://anongit.freedesktop.org/drm-intel BIOS : SKLSE2R1.86C.B066.R00.1412251115 ME FW : 11.0.0.1100 Ksc (EC FW): 1.00 Mesa: mesa-10.6.3 ddc976368fef367e464472ebcc2ac4fd89eb9fd8 from http://cgit.freedesktop.org/mesa/mesa/ Xf86_video_intel: 2.99.917 baec802b21387d04aebb10ac29e719a1800c5aa0 from http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/ Libdrm: libdrm-2.4.62 ba4b5ac010ab85406ec52e3906e13d58cd9aa782 from http://cgit.freedesktop.org/mesa/drm/ Cairo: 1.14.2 from 93422b3cb5e0ef8104b8194c8873124ce2f5ea2d http://cgit.freedesktop.org/cairo libva: libva-1.6.0 a8008998bc0d4a76ae6927607c048e52ba50fd0e from http://cgit.freedesktop.org/libva/ intel-driver: 1.6.0 32268c46d538667d437dc9266aa4c183e51c1286 from http://cgit.freedesktop.org/vaapi/intel-driver Xserver: xorg-server-1.17.2 2123f7682d522619f101b05fb75efa75dabbe371 from http://cgit.freedesktop.org/xorg/xserver Kernel commit log: commit cbad75dd043b82a0684e94fe3b5f994fe0c3414b Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Aug 11 12:08:05 2015 +0200 drm-intel-nightly: 2015y-08m-11d-10h-07m-28s UTC integration manifest igt 1.11-5c07135b Tested again on SKL-S after a full power cycle (last time I first reproduced the GPU hang with last version of igt, reset was not enough). Test worked several times then GPU hang is reproduced. Tested again on SKL-S after a full power cycle (last time I first reproduced the GPU hang with last version of igt, reset was not enough). Test works several times (from 4 to 10 iterations) then GPU hang is reproduced. Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831098] [drm] stuck on render ring Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831101] [drm] stuck on bsd ring Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831102] [drm] stuck on blitter ring Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831104] [drm] stuck on video enhancement ring Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831601] [drm] GPU HANG: ecode 9:0:0x85d7fffb, in gem_concurrent_ [4417], reason: Ring hung, action: reset Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831666] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831667] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831669] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831670] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Aug 14 04:44:18 qa-Skylake-Client-platform kernel: [ 235.831671] [drm] GPU crash dump saved to /sys/class/drm/card0/error Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [ 236.536375] [drm:gen8_do_reset [i915]] *ERROR* bsd ring: reset request timeout Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [ 236.538521] drm/i915: Resetting chip after gpu hang Aug 14 04:44:19 qa-Skylake-Client-platform kernel: [ 236.538535] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5 Test executed successfully 100 times on SKL-U and SKL-Y. SKL-U Platform: SKY LAKE U A0 CPU : Intel(R) CPU @ 1.6GHz (family:6, model: 78 stepping: 2) MCP : SKL-Y C1 2+3 ( ULX-C1) QDF : QJ57 CPU : SKL C1 Chipset PCH: Sunrise Point LP B1 CRB : SKY LAKE U A0 DDR3L RVP7 FAB1 Reworks : All Mandatories Software BIOS : SKLSE2R1.R00.X090.B01.1506290657 ME FW : 11.0.0.1158 Ksc (EC FW): 1.15 SKL-Y CPU : Intel(R) Core(TM) m3-6Y30 CPU @ 0.8GHz 4MB (family: 6, model: 78 stepping: 3) MCP : SKL-Y D1 2+2 QDF : QVY3 CPU : SKL D1 Chipset PCH: Sunrise Point LP C1 CRB : SKY LAKE Y LPDDR3 RVP3 CRB FAB2 Reworks : All Mandatories + FBS02 & FBS03, O-06 Software BIOS : SKLSE2R1.R00.X093.B02.1507222151 ME FW : 11.0.0.1157 Ksc (EC FW): 1.15 Same kernel, GFX stack and IGT versions. If QA has tested for this bug and has not been able to reproduce it, is it time to close the bug as fixed? If not, what else needs to be done? Bug scrub: Let's QA do a final verification - with 500 iterations like in the Jira bug - with including the fix for the last regression (driver regression identified debugging for the original problem) - with last CPU on SKL-U - also on SKL-H What was the verification result? - SKL-U does not boot with new CPU (we asked support) Not reproduced with old CPU, fresh setup and 500 iterations - SKL-H now boots but with power supply problem (we asked support), we will report for this platform if needed - SKL-Y: not reproduced with fresh setup and 500 iterations - SKL-S: no HW upgrade available, not reproduced according to Mika Kuoppala. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.