Bug 96743

Summary:	[BYT, HSW, SKL, BXT, KBL] GPU hangs with GfxBench 4.0 CarChase
Product:	Mesa	Reporter:	Eero Tamminen <eero.t.tamminen>
Component:	Drivers/DRI/i965	Assignee:	Topi Pohjolainen <topi.pohjolainen>
Status:	VERIFIED FIXED	QA Contact:	Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity:	normal
Priority:	high	CC:	annie.j.matheson, ben, conselvan2, randy.xu, samu.kaajas
Version:	git
Hardware:	x86-64 (AMD64)
OS:	All
See Also:	https://bugs.freedesktop.org/show_bug.cgi?id=96291 https://bugs.freedesktop.org/show_bug.cgi?id=98104 https://bugs.freedesktop.org/show_bug.cgi?id=100932 https://bugs.freedesktop.org/show_bug.cgi?id=101406 https://bugs.freedesktop.org/show_bug.cgi?id=102286
Whiteboard:
i915 platform:		i915 features:
Bug Depends on:	96291
Bug Blocks:
Attachments:	CarChase screenshot with rendering issue How it should look like i915 error info on GPU hang dmesg of the hang Where X hangs later on (which then causes tests to hang) Error dump on HSW with nir opts disabled on top of upstream

Description Eero Tamminen 2016-06-30 08:44:42 UTC

Setup:
- BYT N2820
- Ubuntu 16.04
- GfxBench v4: https://gfxbench.com/linux-download/
- Mesa from few days ago
- drm-nightly kernel from same day as Mesa

Use-case:
- Start GfxBench
- Run Car Chase test few times

Expected outcome:
- No hangs

Actual outcome:
- On some runs, multiple recovered GPU hangs in dmesg
------------
[drm] GPU HANG: ecode 7:0:0x87f77c3e, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x8ff8ffff, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x87f7edfe, in testfw_app [10406], reason: Engine(s) hung, action: reset
[drm] GPU HANG: ecode 7:0:0x85fffffc, in testfw_app [10406], reason: Engine(s) hung, action: reset
------------

Unlike with bug 96291, this doesn't necessarily happen on every run.

I've seen GPU hangs with this test also on HSW GT3e and BXT-P, but there they're rarer.  As they don't happen everytime, I'm not sure whether this is Mesa issue (haven't got test machine myself, these are from automated run logs).

Comment 1 Eero Tamminen 2016-07-05 11:21:18 UTC

I've now seen hangs in this test also on SKL.  I haven't seen them on BDW or BSW.

On BXT-P the hangs happen on about every run, on BYT maybe on one run out of three and on HSW GT3e (brixbox) & fast SKL GT2, quite rarely (1/10 run?).

Haven't seen issues with CSDof or CSCloth, so it might not be related to compute shaders like bug 96291 was.

Comment 2 Eero Tamminen 2016-07-12 12:31:14 UTC

Created attachment 125024 [details]
CarChase screenshot with rendering issue

Actually, CarChase has still some rendering issue, which may be related to GPU hangs.  The leaves in the palms etc, don't have correct colors and sometimes flicker.  In the beginning the colors are like in the attached screenshot, later in the demo they have purple/blue edges.

Comment 3 Eero Tamminen 2016-07-12 12:31:59 UTC

Created attachment 125025 [details]
How it should look like

These screenshots are from SKL GT2.

Comment 4 Eero Tamminen 2016-08-23 07:50:36 UTC

Hangs are still happening.

Ian, if you are not looking at this, could you find somebody who's going to look at it?

Comment 5 Eero Tamminen 2016-12-07 15:31:43 UTC

CarChase rendering looks now OK.

(Hangs happens still, almost every run with the offscreen version on SKL GT2 and with both versions on HSW GT3e.)

Comment 6 Eero Tamminen 2016-12-19 16:27:11 UTC

Created attachment 128555 [details]
i915 error info on GPU hang

Comment 7 Eero Tamminen 2016-12-19 16:27:34 UTC

Created attachment 128556 [details]
dmesg of the hang

Comment 8 Eero Tamminen 2016-12-19 16:28:08 UTC

Created attachment 128557 [details]
Where X hangs later on (which then causes tests to hang)

Comment 9 Eero Tamminen 2017-01-10 09:05:46 UTC

GPU hangs happen still regularly with the offscreen version on SKL GT2 & BXT.   It's ~3 weeks since I've seen it on HSW GT3e (where both onscreen & offscreen versions hanged regularly earlier).

Comment 10 Ben Widawsky 2017-02-14 04:56:03 UTC

Eero, I'm assuming you're still hitting this. Can you verify the error state is the same?

Comment 11 Eero Tamminen 2017-02-16 14:28:37 UTC

(In reply to Ben Widawsky from comment #10)
> Eero, I'm assuming you're still hitting this.

Yes, every day on SKL GT2, and at least every other day on BXT.

> Can you verify the error state is the same?

I'll send you links to few latest ones for SKL & BXT.

Comment 12 Eero Tamminen 2017-02-16 16:38:43 UTC

Did a clean boot with few days old 3D stack on BXT, and ran CarChase offscreen 20x times to see whether this test alone is still able to trigger GPU hang.  -> Performance was OK for all the rounds, but there was GPU hang from the test.

I.e. this test alone is sufficient to trigger the hang, it doesn't require running anything else, even on BXT.

(Topi's using GT2, so can't test that right now, but hangs should be even easier to trigger on that.)

It's possible that getting perf drop from the hang would require running also other test-cases, as I didn't see that.

Comment 13 Topi Pohjolainen 2017-03-08 13:16:24 UTC

On my SKL gt2 I seem to get a hang on every run. I tried disabling HIZ and I get roughly one hang in every three runs. Disabling lossless compression seems to make the hangs go away: using INTEL_DEBUG=norbc I got five clean runs in a row.

Comment 14 Topi Pohjolainen 2017-03-09 12:30:32 UTC

After some narrowing down I hacked the driver to use CCS_E only for three surfaces of size 1920:1080. None of them is mipmapped or arrayed. Two of them are rendered and sampled as pair. Disable CCS_E even for one these and I don't seem to get the gpu hang anymore. But even with these three I get roughly one run out of four runs without the hang. So I don't have anything conclusive yet. Still trying to narrow this down.

Comment 15 Topi Pohjolainen 2017-03-13 07:46:49 UTC

I still haven't reached the root cause but having tried various things I thought better listing down some possibly interesting bits:

1) Making clears a no-op gives me a hang in L3 cache reconfig (see gen7_l3_state.c::setup_l3_config()). Specifically in the register write: MI_LOAD_REGISTER_IMM.

2) Making the L3 cache reconfig a no-op in turn moves the hang into compute pipeline (carchase also renders some bits using GPGPU). There it hits the second flush issued by intel_fbo.c::brw_render_cache_set_check_flush().

3) Here I started reading specs again and found a bit we are currently ignoring in  few places. SKL PRM, Volume 7, Flush Types: Tex invalidate: Requires stall bit
([20] of DW) set for all GPGPU Workloads. I added logic considering the current
renderer (brw->last_pipeline == BRW_COMPUTE_PIPELINE) and using additionally the CS stall bit for compute (for the L3 and render cache flush).

4) But all this simply gave me the hang again in brw_render_cache_set_check_flush() but now in 3D pipeline.

Comment 16 Topi Pohjolainen 2017-03-13 07:52:38 UTC

Forgot to mention that brw_render_cache_set_check_flush() likes to flush depth caches among other things. I made that conditional on the render type as well omitting it for compute.

Comment 17 Topi Pohjolainen 2017-03-16 08:28:58 UTC

We are missing full stall on SKL when switching from 3D to GPGPU (and vice versa).  Patch is in the list. I'll take a look at HSW (for which the patch has no effect) now that I'm more familiar with the context.

Comment 18 Topi Pohjolainen 2017-03-16 12:53:44 UTC

I tried with hsw-gt3e quite a few times both offscreen and onscreen versions but couldn't get it to hang with latest nightly build. Can you try again?

Comment 19 Eero Tamminen 2017-03-16 13:09:42 UTC

As mentioned earlier in the bug, hangs have lately (since about mid December) been happening only on GEN9.

They happen approximately on every 3rd run on BXT & SKL GT2 with normal kernels, and with GuC/SLPC kernels also on eDRAM versions.  Hangs with and without GuC+SLPC seem to happen in same place, so on which specific GEN9 HW hang happens is likely to be just a timing issue.

Comment 20 Topi Pohjolainen 2017-03-20 15:11:20 UTC

Introduction of loop unrolling in nir (commit: 715f0d06d19e7c33d98f99c764c5c3249d13b1c0) seems to hide/fix this on HSW. With unrolling I ran 15 times in a row without a hang. Disabling the unrolling came me hang in the 2nd round. Disabling the unroll in current master doesn't seem to bring back the hang though.

I inserted a commit disabling the unroll just after the commit introducing it and trying to bisect again to see which commit fixes/hides it with the loop unroll.

Comment 21 Topi Pohjolainen 2017-03-25 07:42:13 UTC

Okay, disabling these two optimizations gives the gpu hangs on HSW even on top of current upstream:

diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
index f863085..5d83ce3 100644
--- a/src/intel/compiler/brw_nir.c
+++ b/src/intel/compiler/brw_nir.c
@@ -470,7 +470,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
    do {
       progress = false;
       OPT_V(nir_lower_vars_to_ssa);
-      OPT(nir_opt_copy_prop_vars);
 
       if (is_scalar) {
          OPT(nir_lower_alu_to_scalar);
@@ -498,9 +497,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler *compiler,
          OPT(nir_opt_dce);
       }
       OPT(nir_opt_if);
-      if (nir->options->max_unroll_iterations != 0) {
-         OPT(nir_opt_loop_unroll, indirect_mask);
-      }
       OPT(nir_opt_remove_phis);
       OPT(nir_opt_undef);
       OPT_V(nir_lower_doubles, nir_lower_drcp |

Comment 22 Topi Pohjolainen 2017-03-29 04:31:26 UTC

Created attachment 130521 [details]
Error dump on HSW with nir opts disabled on top of upstream

Comment 23 Topi Pohjolainen 2017-03-29 05:22:18 UTC

I wanted to double check with Ken and he confirmed what I was reading. Quoting Ken: "that's a unique looking hang...on a 3DSTATE_VERTEX_BUFFERS...with GAFS/TDG/DS/TE/HS/VS/VF busy". Ken gave me good ideas how to proceed. Start looking at the indirect loads, narrow down the shader stage first and then take a look at shaders. It looks there is a bug hidden in vec4.

Comment 24 Randy 2017-04-18 01:56:35 UTC

I encounter a similar issue on BXT/Android platform, where it's UFO not Mesa driver. Per our experiment, the GPU hang is related to ASTC. 
I can easily hit the GPU hang using the CarChase, the reproducing rate is 60%. While if disable the ASTC in case, the reproducing rate is much lower, less than 5%.

I get the GfxBench v4 from https://gfxbench.com/linux-download/, but I always get timeout error when running start.sh. How to solve it?

[ERROR]: GLFW error: 65537: The GLFW library is not initialized

[INFO ]: Service instance constructor
[ERROR]: <netman> netmanException level: 0, reason: Timeout: connect timed out: 138.68.105.98:443
[ERROR]: Handshake failed: Timeout

Comment 25 Topi Pohjolainen 2017-04-27 05:19:49 UTC

I never used any script, I just run it directly:

./gfxbench_gl-4.0.0/bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4

Comment 26 Topi Pohjolainen 2017-04-27 05:25:02 UTC

Just to make certain we are all on the same page where we are at the moment.

1) Originally the bug was filed against HSW. That got "fixed" by optimizations hiding the real bug. Presumably this is in vec4 and indirect loads. Unfortunately I haven't had time to loon into it further.

2) Skylake had hangs due to missing stalls between 3D <-> GPGPU switch. This had been fixed in upstream for a while now.

3) Eero: I'm getting the impression that there are still hangs with SKL?

Comment 27 Eero Tamminen 2017-04-27 08:08:19 UTC

Hangs had been earlier happening on SKL GT2 & BXT daily, more than every other run of the benchmark, but:
* since 2017-03-20, I haven't seen any CarCharse Offscreen hangs on SKL GT2 dmesgs (recently, there are hangs from "rcs0", whatever that is)
* BXT still hangs, but since about 2017-03-13, they are more rare, at most 1 out of 3 runs hangs

These are Mesa changes, not kernel ones.

Comment 28 Eero Tamminen 2017-04-27 08:25:06 UTC

GLK A1 doesn't have any hangs either, unless one uses couple of month old Mesa & kernel.  I.e. whatever is still causing the issue for BXT, doesn't anymore affect GLK.

Comment 29 Mark Janes 2017-05-04 23:24:05 UTC

I can confirm that gpu hangs are easy to reproduce with car chase on BXT.

Comment 30 Topi Pohjolainen 2017-05-16 12:56:47 UTC

Using lab-bxt-antec, I ran:

bin/testfw_app --gfx glfw --gl_api desktop_core --width 1920 --height 1080 --fullscreen 1 --test_id gl_4_off

some 10+ times but couldn't see any hangs. I used latest Mesa upstream: 0ca5bdb330d6b928c1320e5829906b195bd2c4b8.

Any tips for reproducing?

Comment 31 Topi Pohjolainen 2017-05-16 13:22:03 UTC

Onscreen doesn't seem to hang either.

Comment 32 Eero Tamminen 2017-05-16 15:30:12 UTC

Onscreen hasn't hanged on any machine for a long time, only offscreen.

Offscreen was hanging still at least on Saturday;
GPU HANG: ecode 9:0:0x85dfffff, in testfw_app [2308], reason: Hang on rcs0, action: reset
...
rcs0 (submitted by testfw_app [2308], ctx 1 [7], score 0) --- gtt_offset = 0x00000000 feba9000

but I didn't see hangs yesterday.

Comment 33 Mark Janes 2017-05-17 04:48:27 UTC

Topi: which kernel are you using?

Comment 34 Topi Pohjolainen 2017-05-17 05:58:37 UTC

It was using yesterday CI nightly: 4.12.0-rc1. head looks to be:
9b25870f9fa4548ec2bb40e42fa28f35db2189e1

Comment 35 Mark Janes 2017-05-17 22:52:37 UTC

Well, now I can't reproduce gpu hangs on BXT.  I tried offscreen as well.

Comment 36 Eero Tamminen 2017-07-07 11:42:23 UTC

I haven't seen any CarChase hangs in last couple of weeks.

SKL GT2 has some "rcs0" hangs and Vulkan programs nowadays hang on all SoC devices, but those aren't related to CarChase, or other GfxBench GL test-cases.

I.e. it seems that this has been fixed (or at least hidden).  Topi?

Comment 37 Topi Pohjolainen 2017-07-10 14:40:56 UTC

I've been that busy with "ISL on i965"-work that I haven't had time to work on this. I'd still like to check a few things before close this.

Comment 38 Eero Tamminen 2017-07-13 08:56:38 UTC

Only hangs that I have recently seen are for Manhattan 3.0 offscreen on BXT & GLK A1:
[ 2068.789578] [drm] GPU HANG: ecode 9:0:0x85dfffff, in testfw_app [2262], reason: Hang on rcs0, action: reset

And for Heaven (high quality, FullHD fullscreen, with tessellation, no AA) on SKL GT2:
[  736.775039] [drm] GPU HANG: ecode 9:0:0x84df3ec4, in heaven_x64 [1978], reason: Hang on rcs0, action: reset

On BXT B1 also Manhattan 3.1 hangs:
[ 1958.785740] [drm] GPU HANG: ecode 9:0:0x85dfffff, in testfw_app [2524], reason: Hang on rcs0, action: reset

(These are with drm-tip kernel and couple of months older one.)

Comment 39 Eero Tamminen 2017-08-14 16:32:07 UTC

(In reply to Topi Pohjolainen from comment #37)
> I've been that busy with "ISL on i965"-work that I haven't had time to work
> on this. I'd still like to check a few things before close this.

Topi, any updates on that?

(The new hangs I mentioned in my previous comment, seem to have gone away too.)

Comment 40 Topi Pohjolainen 2017-08-15 11:55:22 UTC

(In reply to Eero Tamminen from comment #39)
> (In reply to Topi Pohjolainen from comment #37)
> > I've been that busy with "ISL on i965"-work that I haven't had time to work
> > on this. I'd still like to check a few things before close this.
> 
> Topi, any updates on that?

I still want to take a look. I'm just back from holidays, so it will take a few days.

> 
> (The new hangs I mentioned in my previous comment, seem to have gone away
> too.)

Comment 41 Topi Pohjolainen 2017-08-21 09:58:53 UTC

(In reply to Topi Pohjolainen from comment #21)
> Okay, disabling these two optimizations gives the gpu hangs on HSW even on
> top of current upstream:
> 
> diff --git a/src/intel/compiler/brw_nir.c b/src/intel/compiler/brw_nir.c
> index f863085..5d83ce3 100644
> --- a/src/intel/compiler/brw_nir.c
> +++ b/src/intel/compiler/brw_nir.c
> @@ -470,7 +470,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler
> *compiler,
>     do {
>        progress = false;
>        OPT_V(nir_lower_vars_to_ssa);
> -      OPT(nir_opt_copy_prop_vars);
>  
>        if (is_scalar) {
>           OPT(nir_lower_alu_to_scalar);
> @@ -498,9 +497,6 @@ nir_optimize(nir_shader *nir, const struct brw_compiler
> *compiler,
>           OPT(nir_opt_dce);
>        }
>        OPT(nir_opt_if);
> -      if (nir->options->max_unroll_iterations != 0) {
> -         OPT(nir_opt_loop_unroll, indirect_mask);
> -      }
>        OPT(nir_opt_remove_phis);
>        OPT(nir_opt_undef);
>        OPT_V(nir_lower_doubles, nir_lower_drcp |

Using this on top of current upstream gave me a hang also on IVB already on 3rd try. This allows me to debug locally on my laptop.

Comment 42 Eero Tamminen 2017-08-21 10:45:59 UTC

Topi, bug 102289 just fixed by Jason may be related to this.

Comment 43 Topi Pohjolainen 2017-08-21 13:01:07 UTC

I'll give it a go.

Meanwhile I'll note here some of my findings. First hang I got was actually in blit ring (coming from readpixels()). I simply made the actual blit a no-op to see if I get similar hangs I saw with HSW.

What I got looks very different, seems that batchbuffer contains garbage
(0x423303d0 doesn't look like valid op-code at all):

batchbuffer (render ring (submitted by testfw_app [3096])) at 0x1aa4b000
0x1aa4b000:      0x423303d0: 2D UNKNOWN
0x1aa4b004:      0xc2a90ba2: UNKNOWN
0x1aa4b008:      0x43164f2a: 2D UNKNOWN
0x1aa4b00c:      0x43168086: 2D UNKNOWN
0x1aa4b010:      0x423303d0: 2D UNKNOWN
0x1aa4b014:      0xc2a90ba2: UNKNOWN
0x1aa4b018:      0x43164f2a: 2D UNKNOWN
0x1aa4b01c:      0x43168086: 2D UNKNOWN
0x1aa4b020:      0x42393ff0: 2D UNKNOWN
0x1aa4b024:      0xc2b2caae: UNKNOWN
0x1aa4b028:      0x43199be2: 2D UNKNOWN
0x1aa4b02c:      0x4319cd34: 2D UNKNOWN
0x1aa4b030:      0x42393ff0: 2D UNKNOWN
0x1aa4b034:      0xc2b2caae: UNKNOWN
0x1aa4b038:      0x43199be2: 2D UNKNOWN
0x1aa4b03c:      0x4319cd34: 2D UNKNOWN
0x1aa4b040:      0x422f8890: 2D UNKNOWN
0x1aa4b044:      0xc2a6efdd: UNKNOWN
0x1aa4b048:      0x43189bbb: 2D UNKNOWN
0x1aa4b04c:      0x4318cd10: 2D UNKNOWN
0x1aa4b050:      0x422f8890: 2D UNKNOWN
0x1aa4b054:      0xc2a6efdd: UNKNOWN
0x1aa4b058:      0x43189bbb: 2D UNKNOWN
0x1aa4b05c:      0x4318cd10: 2D UNKNOWN
0x1aa4b060:      0xc1208f4b: UNKNOWN
0x1aa4b064:      0xc132a236: UNKNOWN
0x1aa4b068:      0x431b235e: 2D UNKNOWN
0x1aa4b06c:      0x431b54aa: 2D UNKNOWN
0x1aa4b070:      0x00000000: MI_NOOP
0x1aa4b074:      0x7a000003: PIPE_CONTROL

Comment 44 Topi Pohjolainen 2017-08-21 13:11:19 UTC

(In reply to Topi Pohjolainen from comment #43)
> I'll give it a go.

Unfortunately got the same hang with Jason's fix (just rebased to current upstream). I'll try to re-produce with INTEL_DEBUG=bat next to see what Mesa thinks it is submitting.

Comment 45 Topi Pohjolainen 2017-08-24 12:57:37 UTC

Dumping batches seems to change the runtime dynamics too much and the hang just won't reoccur. I even tried a more light-weight version of batch dumping which just checks that instruction is valid. Both require mapping and un-mapping of the underlying buffer object which might prevent/hide the hang from reappearing within reasonable amount time.

Comment 46 Eero Tamminen 2017-08-24 13:19:03 UTC

If it's timing sensitive, maybe recent Valley hangs are related:
https://bugs.freedesktop.org/show_bug.cgi?id=102286
?

Comment 47 Topi Pohjolainen 2017-08-24 13:30:44 UTC

(In reply to Topi Pohjolainen from comment #45)
> Dumping batches seems to change the runtime dynamics too much and the hang
> just won't reoccur. I even tried a more light-weight version of batch
> dumping which just checks that instruction is valid. Both require mapping
> and un-mapping of the underlying buffer object which might prevent/hide the
> hang from reappearing within reasonable amount time.

Checking instructions before flushing allows one to read the batch thru the mapping used for batch emission. With that I was actually able to catch the user space writing an invalid instruction. Checking that now.

Comment 48 Topi Pohjolainen 2017-08-25 09:36:41 UTC

While I attempted to catch the invalid instruction earlier under debugger I co-incidentally managed to get:

#17 0x00007ffff650a415 in do_flush_locked (out_fence_fd=<optimized out>, 
    in_fence_fd=<optimized out>, brw=<optimized out>)
    at intel_batchbuffer.c:702
#18 _intel_batchbuffer_flush_fence (brw=<optimized out>, 
    in_fence_fd=<optimized out>, out_fence_fd=<optimized out>, 
    file=<optimized out>, line=<optimized out>) at intel_batchbuffer.c:789
#19 0x00007ffff650a519 in intel_batchbuffer_require_space (
    brw=brw@entry=0x7ffff60cb040, sz=32, ring=ring@entry=BLT_RING)
    at intel_batchbuffer.c:237
#20 0x00007ffff650b8d1 in intelEmitCopyBlit (brw=brw@entry=0x7ffff60cb040, 
    cpp=cpp@entry=1, src_pitch=src_pitch@entry=384, 
    src_buffer=src_buffer@entry=0x7ffff000f6a0, src_offset=src_offset@entry=0, 
    src_tiling=src_tiling@entry=ISL_TILING_LINEAR, dst_pitch=384, 
    dst_buffer=0x7fffdc5cdcd0, dst_offset=0, dst_tiling=ISL_TILING_LINEAR, 
    src_x=0, src_y=0, dst_x=0, dst_y=0, w=384, h=1, logic_op=5379)
    at intel_blit.c:527
#21 0x00007ffff650d580 in intel_emit_linear_blit (
    brw=brw@entry=0x7ffff60cb040, dst_bo=0x7fffdc5cdcd0, 
    dst_offset=dst_offset@entry=0, src_bo=src_bo@entry=0x7ffff000f6a0, 
    src_offset=src_offset@entry=0, size=size@entry=384) at intel_blit.c:729
#22 0x00007ffff650e00f in brw_buffer_subdata (ctx=0x7ffff60cb040, offset=0, 
    size=384, data=0x7fffdc5dc810, obj=0x7ffff21988d0)
    at intel_buffer_objects.c:297
#23 0x00007ffff6c9a9c7 in ?? ()
   from /home/tpohjola/work/benchmarks/gfxbench_gl-4.0.0/plugins/libgfxbench40_gl.so


In do_flush_locked() one called execbuffer() which via drmIoctl() resulted into errno 5 (I/O error).

Comment 49 Topi Pohjolainen 2017-08-25 09:46:00 UTC

(In reply to Topi Pohjolainen from comment #48)
> While I attempted to catch the invalid instruction earlier under debugger I
> co-incidentally managed to get:
> 
> #17 0x00007ffff650a415 in do_flush_locked (out_fence_fd=<optimized out>, 
>     in_fence_fd=<optimized out>, brw=<optimized out>)
>     at intel_batchbuffer.c:702
> #18 _intel_batchbuffer_flush_fence (brw=<optimized out>, 
>     in_fence_fd=<optimized out>, out_fence_fd=<optimized out>, 
>     file=<optimized out>, line=<optimized out>) at intel_batchbuffer.c:789
> #19 0x00007ffff650a519 in intel_batchbuffer_require_space (
>     brw=brw@entry=0x7ffff60cb040, sz=32, ring=ring@entry=BLT_RING)
>     at intel_batchbuffer.c:237
> #20 0x00007ffff650b8d1 in intelEmitCopyBlit (brw=brw@entry=0x7ffff60cb040, 
>     cpp=cpp@entry=1, src_pitch=src_pitch@entry=384, 
>     src_buffer=src_buffer@entry=0x7ffff000f6a0,
> src_offset=src_offset@entry=0, 
>     src_tiling=src_tiling@entry=ISL_TILING_LINEAR, dst_pitch=384, 
>     dst_buffer=0x7fffdc5cdcd0, dst_offset=0, dst_tiling=ISL_TILING_LINEAR, 
>     src_x=0, src_y=0, dst_x=0, dst_y=0, w=384, h=1, logic_op=5379)
>     at intel_blit.c:527
> #21 0x00007ffff650d580 in intel_emit_linear_blit (
>     brw=brw@entry=0x7ffff60cb040, dst_bo=0x7fffdc5cdcd0, 
>     dst_offset=dst_offset@entry=0, src_bo=src_bo@entry=0x7ffff000f6a0, 
>     src_offset=src_offset@entry=0, size=size@entry=384) at intel_blit.c:729
> #22 0x00007ffff650e00f in brw_buffer_subdata (ctx=0x7ffff60cb040, offset=0, 
>     size=384, data=0x7fffdc5dc810, obj=0x7ffff21988d0)
>     at intel_buffer_objects.c:297
> #23 0x00007ffff6c9a9c7 in ?? ()
>    from
> /home/tpohjola/work/benchmarks/gfxbench_gl-4.0.0/plugins/libgfxbench40_gl.so
> 
> 
> In do_flush_locked() one called execbuffer() which via drmIoctl() resulted
> into errno 5 (I/O error).

There was also a gpu hang which might be something else than what I was expecting see (invalid instruction). Unfortunately I didn't get a error dump for this as I forgot to empty the error dump file.

Comment 50 Topi Pohjolainen 2017-08-28 12:52:19 UTC

Just as a status update, I've have tried adding unreachable("") exits and checks for the first dword in the batch buffer in various places according to possible explanations that I could come up with. So far no luck (well, I have ruled out some scenarios I suppose). Moreover, the hangs are not that re-producible, sometimes 3-4 runs is enough but sometimes even 20 rounds just passes merrily (I've been using x1 Carbon which also heats up pretty badly - starts to smell).

Comment 51 Topi Pohjolainen 2017-08-31 17:03:05 UTC

For quite some time I only saw _intel_batchbuffer_flush_fence() when switching from render ring to blit ring. And specifically because the following condition fired:

   /* If we're switching rings, implicitly flush the batch. */
   if (unlikely(ring != brw->batch.ring) && brw->batch.ring != UNKNOWN_RING &&
       brw->gen >= 6)

I thought this was for some reason significant. But now I've gotten the hang also when flushing due to running out of batch space and also from render ring itself (brw_try_draw_prims()).

What is also curious is that error decode shows the batch buffer containing garbage starting from the beginning but both brw->batch->bo->map_cpu and brw->batch->last_bo->map_cpu give me sane values (0x7a000003 and 0x54c00006 respectively). Error decode says the first dword is 0x423303d0.

Comment 52 Topi Pohjolainen 2017-08-31 17:04:09 UTC

(In reply to Topi Pohjolainen from comment #51)
> For quite some time I only saw _intel_batchbuffer_flush_fence() when

Meant to read "...saw _intel_batchbuffer_flush_fence() failing".

Comment 53 Topi Pohjolainen 2017-09-01 14:33:19 UTC

I wanted to take a closer look on things in kernel and hence compiled drm-intel/master. That gives me a lot less "garbage" well inside the batch.

0x19c7bc20:  0xc1c7182d : Dword 4
    Dispatch GRF Start Register For URB Data: 28
    Vertex URB Entry Read Length: 35
    Vertex URB Entry Read Offset: 2
0x19c7bc24:  0x409e289e : Dword 5
    Maximum Number of Threads: 32
    Statistics Enable: false
    Vertex Cache Disable: true
    Enable: false
unknown instruction 41c446cd
0x19c7bf64:  0x00000000:  MI_NOOP                                                
    Identification Number Register Write Enable: false
    Identification Number: 0
0x19c7bf68:  0x00000000:  MI_NOOP                                                
    Identification Number Register Write Enable: false
    Identification Number: 0
0x19c7bf6c:  0x00000000:  MI_NOOP                                                
    Identification Number Register Write Enable: false
    Identification Number: 0
0x19c7bf70:  0x781f000c:  3DSTATE_SBE

Comment 54 Topi Pohjolainen 2017-09-26 17:14:17 UTC

Interesting, carchase also uses a fragment shader which doesn't have any inputs, uniforms or varyings. Hence on SKL and BXT this might have been a reason even though it apparently doesn't show up with current upstream:

https://lists.freedesktop.org/archives/mesa-dev/2017-September/170531.html

Comment 55 Topi Pohjolainen 2017-09-28 15:59:02 UTC

The hangs with IVB and HSW are really troublesome to debug. With my own IVB the reproducibility of the bug changes a lot - sometimes I get it with a few rounds and sometimes it runs 20 rounds without a hang. This makes it very difficult to try things out as one can't tell for certain if something has an effect - how many rounds are really needed to tel for certain?

For this reason I started using one of the machines in the lab - lab_hsw_plex. With that I couldn't get a hang simply by reverting the two compiler optimizations discussed earlier. Going back in the commit history until the introduction of the older of the two optimizations gave me a hang - after some 15 rounds. But since then I haven't been able to get it any more. Moreover, the signature of the hang pointed to depth clear which is different than before. Hence I'm guessing I just hit another bug for which I missed the fix while I moved too far back in the history.

All in all this starts to look like a futile effort considering that none of the hangs, either on gen7 or later can't be reproduced with current upstream. I'm going to give one more day for this exercise before my holidays. If nothing comes out of it, I think we should probably close this.

Comment 56 Eero Tamminen 2017-09-28 16:16:26 UTC

(In reply to Topi Pohjolainen from comment #55)
> All in all this starts to look like a futile effort considering that none of
> the hangs, either on gen7 or later can't be reproduced with current
> upstream.

If you have BSW, you could take a quick look at bug 101406 CarChase misrendering in case that has some relation to whatever triggers the hangs (I haven't seen CarChase hangs on BSW with this year's Mesa though).


> I'm going to give one more day for this exercise before my
> holidays. If nothing comes out of it, I think we should probably close this.

I agree.

Btw. I think nowadays BDW GfxBench T-Rex hang bug 102286 is more interesting. Although that happens quite rarely, it at least happens with newer Mesa without needing to disable anything.

Comment 57 Topi Pohjolainen 2017-10-24 07:22:07 UTC

I don't think there is more to be done here. Eero, do you mind closing this?

Comment 58 Eero Tamminen 2017-10-24 08:30:52 UTC

Haven't seen any hangs in couple of last months, so closing as fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.