Bug 96743 - [BYT, HSW, SKL, BXT, KBL] GPU hangs with GfxBench 4.0 CarChase
Summary: [BYT, HSW, SKL, BXT, KBL] GPU hangs with GfxBench 4.0 CarChase
Status: NEW
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) All
: high normal
Assignee: Topi Pohjolainen
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on: 96291
Blocks:
  Show dependency treegraph
 
Reported: 2016-06-30 08:44 UTC by Eero Tamminen
Modified: 2017-05-17 22:52 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
CarChase screenshot with rendering issue (770.17 KB, image/png)
2016-07-12 12:31 UTC, Eero Tamminen
Details
How it should look like (761.06 KB, image/png)
2016-07-12 12:31 UTC, Eero Tamminen
Details
i915 error info on GPU hang (54.69 KB, text/plain)
2016-12-19 16:27 UTC, Eero Tamminen
Details
dmesg of the hang (47.05 KB, text/plain)
2016-12-19 16:27 UTC, Eero Tamminen
Details
Where X hangs later on (which then causes tests to hang) (654 bytes, text/plain)
2016-12-19 16:28 UTC, Eero Tamminen
Details
Error dump on HSW with nir opts disabled on top of upstream (135.71 KB, text/plain)
2017-03-29 04:31 UTC, Topi Pohjolainen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2016-06-30 08:44:42 UTC
Setup:
- BYT N2820
- Ubuntu 16.04
- GfxBench v4: https://gfxbench.com/linux-download/
- Mesa from few days ago
- drm-nightly kernel from same day as Mesa

Use-case:
- Start GfxBench
- Run Car Chase test few times

Expected outcome:
- No hangs

Actual outcome:
- On some runs, multiple recovered GPU hangs in dmesg
------------
[drm] GPU HANG: ecode 7:0:0x87f77c3e, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x8ff8ffff, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x87f7edfe, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x85fffffc, in testfw_app [10406], reason: Engine(s) hung, action: reset
------------

Unlike with bug 96291, this doesn't necessarily happen on every run.

I've seen GPU hangs with this test also on HSW GT3e and BXT-P, but there they're rarer.  As they don't happen everytime, I'm not sure whether this is Mesa issue (haven't got test machine myself, these are from automated run logs).
Comment 1 Eero Tamminen 2016-07-05 11:21:18 UTC
I've now seen hangs in this test also on SKL.  I haven't seen them on BDW or BSW.

On BXT-P the hangs happen on about every run, on BYT maybe on one run out of three and on HSW GT3e (brixbox) & fast SKL GT2, quite rarely (1/10 run?).

Haven't seen issues with CSDof or CSCloth, so it might not be related to compute shaders like bug 96291 was.
Comment 2 Eero Tamminen 2016-07-12 12:31:14 UTC
Created attachment 125024 [details]
CarChase screenshot with rendering issue

Actually, CarChase has still some rendering issue, which may be related to GPU hangs.  The leaves in the palms etc, don't have correct colors and sometimes flicker.  In the beginning the colors are like in the attached screenshot, later in the demo they have purple/blue edges.
Comment 3 Eero Tamminen 2016-07-12 12:31:59 UTC
Created attachment 125025 [details]
How it should look like

These screenshots are from SKL GT2.
Comment 4 Eero Tamminen 2016-08-23 07:50:36 UTC
Hangs are still happening.

Ian, if you are not looking at this, could you find somebody who's going to look at it?
Comment 5 Eero Tamminen 2016-12-07 15:31:43 UTC
CarChase rendering looks now OK.

(Hangs happens still, almost every run with the offscreen version on SKL GT2 and with both versions on HSW GT3e.)
Comment 6 Eero Tamminen 2016-12-19 16:27:11 UTC
Created attachment 128555 [details]
i915 error info on GPU hang
Comment 7 Eero Tamminen 2016-12-19 16:27:34 UTC
Created attachment 128556 [details]
dmesg of the hang
Comment 8 Eero Tamminen 2016-12-19 16:28:08 UTC
Created attachment 128557 [details]
Where X hangs later on (which then causes tests to hang)
Comment 9 Eero Tamminen 2017-01-10 09:05:46 UTC
GPU hangs happen still regularly with the offscreen version on SKL GT2 & BXT.   It's ~3 weeks since I've seen it on HSW GT3e (where both onscreen & offscreen versions hanged regularly earlier).
Comment 10 Ben Widawsky 2017-02-14 04:56:03 UTC
Eero, I'm assuming you're still hitting this. Can you verify the error state is the same?
Comment 11 Eero Tamminen 2017-02-16 14:28:37 UTC
(In reply to Ben Widawsky from comment #10)
> Eero, I'm assuming you're still hitting this.

Yes, every day on SKL GT2, and at least every other day on BXT.

> Can you verify the error state is the same?

I'll send you links to few latest ones for SKL & BXT.
Comment 12 Eero Tamminen 2017-02-16 16:38:43 UTC
Did a clean boot with few days old 3D stack on BXT, and ran CarChase offscreen 20x times to see whether this test alone is still able to trigger GPU hang.  -> Performance was OK for all the rounds, but there was GPU hang from the test.

I.e. this test alone is sufficient to trigger the hang, it doesn't require running anything else, even on BXT.

(Topi's using GT2, so can't test that right now, but hangs should be even easier to trigger on that.)

It's possible that getting perf drop from the hang would require running also other test-cases, as I didn't see that.
Comment 13 Topi Pohjolainen 2017-03-08 13:16:24 UTC
On my SKL gt2 I seem to get a hang on every run. I tried disabling HIZ and I get roughly one hang in every three runs. Disabling lossless compression seems to make the hangs go away: using INTEL_DEBUG=norbc I got five clean runs in a row.
Comment 14 Topi Pohjolainen 2017-03-09 12:30:32 UTC
After some narrowing down I hacked the driver to use CCS_E only for three surfaces of size 1920:1080. None of them is mipmapped or arrayed. Two of them are rendered and sampled as pair. Disable CCS_E even for one these and I don't seem to get the gpu hang anymore. But even with these three I get roughly one run out of four runs without the hang. So I don't have anything conclusive yet. Still trying to narrow this down.
Comment 15 Topi Pohjolainen 2017-03-13 07:46:49 UTC
I still haven't reached the root cause but having tried various things I thought better listing down some possibly interesting bits:

1) Making clears a no-op gives me a hang in L3 cache reconfig (see gen7_l3_state.c::setup_l3_config()). Specifically in the register write: MI_LOAD_REGISTER_IMM.

2) Making the L3 cache reconfig a no-op in turn moves the hang into compute pipeline (carchase also renders some bits using GPGPU). There it hits the second flush issued by intel_fbo.c::brw_render_cache_set_check_flush().

3) Here I started reading specs again and found a bit we are currently ignoring in  few places. SKL PRM, Volume 7, Flush Types: Tex invalidate: Requires stall bit
([20] of DW) set for all GPGPU Workloads. I added logic considering the current
renderer (brw->last_pipeline == BRW_COMPUTE_PIPELINE) and using additionally the CS stall bit for compute (for the L3 and render cache flush).

4) But all this simply gave me the hang again in brw_render_cache_set_check_flush() but now in 3D pipeline.
Comment 16 Topi Pohjolainen 2017-03-13 07:52:38 UTC
Forgot to mention that brw_render_cache_set_check_flush() likes to flush depth caches among other things. I made that conditional on the render type as well omitting it for compute.
Comment 17 Topi Pohjolainen 2017-03-16 08:28:58 UTC
We are missing full stall on SKL when switching from 3D to GPGPU (and vice versa).  Patch is in the list. I'll take a look at HSW (for which the patch has no effect) now that I'm more familiar with the context.
Comment 18 Topi Pohjolainen 2017-03-16 12:53:44 UTC
I tried with hsw-gt3e quite a few times both offscreen and onscreen versions but couldn't get it to hang with latest nightly build. Can you try again?
Comment 19 Eero Tamminen 2017-03-16 13:09:42 UTC
As mentioned earlier in the bug, hangs have lately (since about mid December) been happening only on GEN9.

They happen approximately on every 3rd run on BXT & SKL GT2 with normal kernels, and with GuC/SLPC kernels also on eDRAM versions.  Hangs with and without GuC+SLPC seem to happen in same place, so on which specific GEN9 HW hang happens is likely to be just a timing issue.
Comment 20 Topi Pohjolainen 2017-03-20 15:11:20 UTC
Introduction of loop unrolling in nir (commit: 715f0d06d19e7c33d98f99c764c5c3249d13b1c0) seems to hide/fix this on HSW. With unrolling I ran 15 times in a row without a hang. Disabling the unrolling came me hang in the 2nd round. Disabling the unroll in current master doesn't seem to bring back the hang though.

I inserted a commit disabling the unroll just after the commit introducing it and trying to bisect again to see which commit fixes/hides it with the loop unroll.
Comment 21 Topi Pohjolainen 2017-03-25 07:42:13 UTC
Okay, disabling these two optimizations gives the gpu hangs on HSW even on top of current upstream:

diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index f863085..5d83ce3 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -470,7 +470,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
    do {
       progress = false;
       OPT_V(nir_lower_vars_to_ssa);
-      OPT(nir_opt_copy_prop_vars);
 
       if (is_scalar) {
          OPT(nir_lower_alu_to_scalar);
@@ -498,9 +497,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
          OPT(nir_opt_dce);
       }
       OPT(nir_opt_if);
-      if (nir->options->max_unroll_iterations != 0) {
-         OPT(nir_opt_loop_unroll, indirect_mask);
-      }
       OPT(nir_opt_remove_phis);
       OPT(nir_opt_undef);
       OPT_V(nir_lower_doubles, nir_lower_drcp |
Comment 22 Topi Pohjolainen 2017-03-29 04:31:26 UTC
Created attachment 130521 [details]
Error dump on HSW with nir opts disabled on top of upstream
Comment 23 Topi Pohjolainen 2017-03-29 05:22:18 UTC
I wanted to double check with Ken and he confirmed what I was reading. Quoting Ken: "that's a unique looking hang...on a 3DSTATE_VERTEX_BUFFERS...with GAFS/TDG/DS/TE/HS/VS/VF busy". Ken gave me good ideas how to proceed. Start looking at the indirect loads, narrow down the shader stage first and then take a look at shaders. It looks there is a bug hidden in vec4.
Comment 24 Randy 2017-04-18 01:56:35 UTC
I encounter a similar issue on BXT/Android platform, where it's UFO not Mesa driver. Per our experiment, the GPU hang is related to ASTC. 
I can easily hit the GPU hang using the CarChase, the reproducing rate is 60%. While if disable the ASTC in case, the reproducing rate is much lower, less than 5%.

I get the GfxBench v4 from https://gfxbench.com/linux-download/, but I always get timeout error when running start.sh. How to solve it?

[ERROR]: GLFW error: 65537: The GLFW library is not initialized

[INFO ]: Service instance constructor
[ERROR]: <netman> netmanException level: 0, reason: Timeout: connect timed out: 138.68.105.98:443
[ERROR]: Handshake failed: Timeout
Comment 25 Topi Pohjolainen 2017-04-27 05:19:49 UTC
I never used any script, I just run it directly:

./gfxbench_gl-4.0.0/bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4
Comment 26 Topi Pohjolainen 2017-04-27 05:25:02 UTC
Just to make certain we are all on the same page where we are at the moment.

1) Originally the bug was filed against HSW. That got "fixed" by optimizations hiding the real bug. Presumably this is in vec4 and indirect loads. Unfortunately I haven't had time to loon into it further.

2) Skylake had hangs due to missing stalls between 3D <-> GPGPU switch. This had been fixed in upstream for a while now.

3) Eero: I'm getting the impression that there are still hangs with SKL?
Comment 27 Eero Tamminen 2017-04-27 08:08:19 UTC
Hangs had been earlier happening on SKL GT2 & BXT daily, more than every other run of the benchmark, but:
* since 2017-03-20, I haven't seen any CarCharse Offscreen hangs on SKL GT2 dmesgs (recently, there are hangs from "rcs0", whatever that is)
* BXT still hangs, but since about 2017-03-13, they are more rare, at most 1 out of 3 runs hangs

These are Mesa changes, not kernel ones.
Comment 28 Eero Tamminen 2017-04-27 08:25:06 UTC
GLK A1 doesn't have any hangs either, unless one uses couple of month old Mesa & kernel.  I.e. whatever is still causing the issue for BXT, doesn't anymore affect GLK.
Comment 29 Mark Janes 2017-05-04 23:24:05 UTC
I can confirm that gpu hangs are easy to reproduce with car chase on BXT.
Comment 30 Topi Pohjolainen 2017-05-16 12:56:47 UTC
Using lab-bxt-antec, I ran:

bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4_off

some 10+ times but couldn't see any hangs. I used latest Mesa upstream: 0ca5bdb330d6b928c1320e5829906b195bd2c4b8.

Any tips for reproducing?
Comment 31 Topi Pohjolainen 2017-05-16 13:22:03 UTC
Onscreen doesn't seem to hang either.
Comment 32 Eero Tamminen 2017-05-16 15:30:12 UTC
Onscreen hasn't hanged on any machine for a long time, only offscreen.

Offscreen was hanging still at least on Saturday;
GPU HANG: ecode 9:0:0x85dfffff, in testfw_app [2308], reason: Hang on rcs0, action: reset
...
rcs0 (submitted by testfw_app [2308], ctx 1 [7], score 0) --- gtt_offset = 0x00000000 feba9000

but I didn't see hangs yesterday.
Comment 33 Mark Janes 2017-05-17 04:48:27 UTC
Topi: which kernel are you using?
Comment 34 Topi Pohjolainen 2017-05-17 05:58:37 UTC
It was using yesterday CI nightly: 4.12.0-rc1. head looks to be:
9b25870f9fa4548ec2bb40e42fa28f35db2189e1
Comment 35 Mark Janes 2017-05-17 22:52:37 UTC
Well, now I can't reproduce gpu hangs on BXT.  I tried offscreen as well.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.