Bug 111385

Summary: (Only partly recoverable) GPU hangs in (multi-context) SynMark HDRBloom & Multithread tests with Iris driver
Product: DRI Reporter: Eero Tamminen <eero.t.tamminen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: chris, francesco.balestrieri, intel-gfx-bugs, leho
Version: DRI gitKeywords: regression
Hardware: x86-64 (AMD64)   
OS: All   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=111384
https://bugs.freedesktop.org/show_bug.cgi?id=111424
https://bugs.freedesktop.org/show_bug.cgi?id=111748
https://bugs.freedesktop.org/show_bug.cgi?id=111782
https://bugs.freedesktop.org/show_bug.cgi?id=111936
Whiteboard: ReadyForDev
i915 platform: BDW, ICL, KBL, SKL i915 features: GPU hang
Attachments:
Description Flags
SKL GT2 (recoverable) GPU hang error state
none
SKL GT4e (recoverable) GPU hang with i965
none
KBL GT3e (recoverable) GPU hang
none
SkullCanyon GPU hang error state from boot + running HdrBloom 10x with Iris
none
ICL-U D1 (recoverable) GPU hang error state none

Description Eero Tamminen 2019-08-12 14:25:18 UTC
Created attachment 145037 [details]
SKL GT2 (recoverable) GPU hang error state

Setup:
- SKL i5 6600K
- Ubuntu 16.04
- drm-tip git kernel (0330b51e91)
- Mesa git (5ed4e31c08d)
- Unity desktop

Test-case:
- 3x fullscreen FullHD HDRBloom multi-context SynMark test-case:
  synmark2 OglHdrBloom

Actual outcome:
- Recoverable GPU hang, but all successive GL tests fail after that:
  i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0

Main difference from this test to most other tests is only few tests use multiple contexts.

I haven't seen such hangs with the i965 driver.

I wasn't able to reproduce the hang after reboot when re-running HDRBloom 10 times, so it may depend on previous tests, or is just very hard to reproduce.

I didn't see such hang when running similar test-sets month ago, so it can be a regression.

On BXT there was a hang in a different test-case.
Comment 1 Eero Tamminen 2019-08-12 16:09:57 UTC
(In reply to Eero Tamminen from comment #0)
> I wasn't able to reproduce the hang after reboot when re-running HDRBloom 10
> times, so it may depend on previous tests, or is just very hard to reproduce.

Was able to reproduce the GPU hang with Iris by running each SynMark test 3x in alphabetical order.  At HdrBloom there was again GPU hang.  This time tests after HdrBloom didn't fail.
Comment 2 Eero Tamminen 2019-08-19 08:21:00 UTC
Got a (non-recoverable) HdrBloom hang also with i965 when using latest Git gfx stack, on SKL GT4e (SkullCanyon), so this might not be Iris specific issue.
Comment 3 Eero Tamminen 2019-08-19 10:40:46 UTC
Created attachment 145096 [details]
SKL GT4e (recoverable) GPU hang with i965

In the SKL GT4e / i965 case, recoverable GPU hang during HdrBloom run appears to be happening in the X server.

If same was true also of the SKL GT2 case (i915 error state didn't specify process), then that's also i965, as I was running only the benchmark itself with Iris on SKL GT2.
Comment 4 Eero Tamminen 2019-08-19 10:50:49 UTC
Last night HdrBloom test had a (non-recoverable) GPU hang (something had at least broken test automation network connection during that exact test).  Moving to i965, as Iris isn't yet enabled by default and this happens (also) with i965.
Comment 5 Eero Tamminen 2019-08-26 10:39:02 UTC
Still getting these.  Following is with i965 used for the desktop and Iris used for the benchmark:

[ 7275.093815] Iteration 1/3: synmark2 OglHdrBloom
[ 7284.938454] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[ 7284.938458] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 7284.938459] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 7284.938460] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 7284.938461] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 7284.938463] GPU crash dump saved to /sys/class/drm/card0/error
[ 7284.939473] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7284.940243] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7284.940583] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 7284.942364] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7284.943124] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7292.938632] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7300.938440] i915 0000:00:02.0: Resetting rcs0 for stuck wait on rcs0
[ 7340.115090] Iteration 2/3: synmark2 OglHdrBloom
[ 7405.136485] Iteration 3/3: synmark2 OglHdrBloom
[ 7436.938401] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7452.939387] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7452.940156] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7452.940246] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 7452.942015] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7452.942773] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 7460.938388] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7470.157140] Iteration 1/3: synmark2 OglMultithread
[ 7470.922402] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7478.922401] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 7486.923396] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Comment 6 Eero Tamminen 2019-08-28 10:21:19 UTC
Seeing definitely more of these with Iris than i965, again both with SKL GT2 & GT4e.

In SKL GT2 case, all the benchmarks following HdrBloom also failed (there just wasn't additional GPU hangs) => seems that i915 GPU state got corrupted.

Just tell if you need more error states.
Comment 7 Eero Tamminen 2019-08-29 09:46:48 UTC
Created attachment 145199 [details]
KBL GT3e (recoverable) GPU hang

This happens also on KBL, error state attached.
Comment 8 Eero Tamminen 2019-08-30 12:17:28 UTC
Moving back to Iris, I haven't seen this on i965 since, but it happens every time with Iris.  Fairly often other tests after this failure fail too.
Comment 9 Eero Tamminen 2019-09-03 10:50:40 UTC
This time SKL GT4e did have recoverable GPU hang in SynMark (CPU<->GPU sync) Terrain tests, instead of in HdrBloom like SKL GT2 & KBL GT3e had.

KBL GT3e recoverable GPU hang dmesg shows some additional issue:
----------------------------------------------------------------
[ 1799.952461] Iteration 1/3: synmark2 OglHdrBloom
[ 1822.876411] i915 0000:00:02.0: GPU HANG: ecode 9:1:0x00000000, hang on rcs0
[ 1822.876414] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1822.876415] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1822.876416] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1822.876417] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[ 1822.876418] GPU crash dump saved to /sys/class/drm/card0/error
[ 1822.877427] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1822.878178] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1822.878298] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 1822.880074] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1822.880814] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1830.876391] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1838.876410] i915 0000:00:02.0: Resetting rcs0 for stuck wait on rcs0
[ 1838.993330] Iteration 2/3: synmark2 OglHdrBloom
[ 1856.861369] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1856.862122] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1856.862206] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 1856.863980] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1856.864751] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1864.860345] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1872.861361] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1872.971858] Iteration 3/3: synmark2 OglHdrBloom
[ 1898.845367] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1898.846124] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1898.846214] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 1898.847988] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1898.848741] [drm:gen8_reset_engines [i915]] *ERROR* rcs0 reset request timed out: {request: 00000001, RESET_CTL: 00000001}
[ 1906.844355] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[ 1916.892378] i915 0000:00:02.0: Resetting rcs0 for stuck wait on rcs0
[ 1916.895299] ------------[ cut here ]------------
[ 1916.895392] WARNING: CPU: 2 PID: 0 at ./include/linux/dma-fence.h:532 i915_request_skip+0xa8/0xc0 [i915]
[ 1916.895393] Modules linked in: fuse i915 nfs lockd grace overlay x86_pkg_temp_thermal coretemp crct10dif_pclmul mei_me e1000e crc32_pclmul mei sunrpc
[ 1916.895405] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.3.0-rc6-CI-Nightly_1860+ #1
[ 1916.895407] Hardware name:  /NUC7i7BNB, BIOS BNKBL357.86A.0062.2018.0222.1644 02/22/2018
[ 1916.895466] RIP: 0010:i915_request_skip+0xa8/0xc0 [i915]
[ 1916.895469] Code: eb c7 48 c7 c7 40 b8 2d a0 89 74 24 04 e8 5e e8 ee e0 0f 0b 8b 74 24 04 eb 93 48 c7 c7 40 b8 2d a0 89 74 24 04 e8 46 e8 ee e0 <0f> 0b 8b 74 24 04 e9 6b ff ff ff 0f 1f 00 66 2e 0f 1f 84 00 00 00
[ 1916.895471] RSP: 0018:ffffc9000011ce48 EFLAGS: 00010086
[ 1916.895473] RAX: 0000000000000024 RBX: ffff88820241c6c0 RCX: 0000000000000103
[ 1916.895475] RDX: 0000000000000000 RSI: ffff888276b163d8 RDI: 00000000ffffffff
[ 1916.895477] RBP: ffffc90031050000 R08: 00000000000002e3 R09: 0000000000000004
[ 1916.895478] R10: ffffc9000011ced8 R11: 0000000000000001 R12: ffff88826796d2c0
[ 1916.895480] R13: ffff8882745da000 R14: ffff888241d02ac0 R15: ffff888241d00fc0
[ 1916.895482] FS:  0000000000000000(0000) GS:ffff888276b00000(0000) knlGS:0000000000000000
[ 1916.895484] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1916.895485] CR2: 0000558f16d1e848 CR3: 000000000340a004 CR4: 00000000003606e0
[ 1916.895487] Call Trace:
[ 1916.895491]  <IRQ>
[ 1916.895546]  __i915_request_submit+0x11d/0x150 [i915]
[ 1916.895595]  execlists_dequeue+0x60c/0xda0 [i915]
[ 1916.895641]  execlists_submission_tasklet+0x59/0x60 [i915]
[ 1916.895648]  tasklet_action_common.isra.4+0x3d/0xa0
[ 1916.895654]  __do_softirq+0xf7/0x34b
[ 1916.895659]  irq_exit+0x98/0xb0
[ 1916.895663]  smp_apic_timer_interrupt+0x8e/0x190
[ 1916.895666]  apic_timer_interrupt+0xf/0x20
[ 1916.895668]  </IRQ>
[ 1916.895673] RIP: 0010:cpuidle_enter_state+0xae/0x450
[ 1916.895676] Code: 49 89 c4 0f 1f 44 00 00 31 ff e8 2d a5 8f ff 45 84 f6 74 12 9c 58 f6 c4 02 0f 85 73 03 00 00 31 ff e8 c6 a4 94 ff fb 45 85 ed <0f> 88 c9 02 00 00 4c 2b 24 24 48 ba cf f7 53 e3 a5 9b c4 20 49 63
[ 1916.895678] RSP: 0018:ffffc900000b7e80 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1916.895680] RAX: ffff888276b00000 RBX: ffffffff824b4340 RCX: 000000000000001f
[ 1916.895682] RDX: 000001be4fdc7708 RSI: 000000002487924d RDI: 0000000000000000
[ 1916.895684] RBP: ffff888276b31e00 R08: 0000000000000002 R09: 0000000000028dc0
[ 1916.895685] R10: ffffc900000b7e60 R11: 000000000000024a R12: 000001be4fdc7708
[ 1916.895687] R13: 0000000000000004 R14: 0000000000000000 R15: 0000000000000004
[ 1916.895694]  cpuidle_enter+0x29/0x40
[ 1916.895698]  do_idle+0x1e9/0x240
[ 1916.895702]  cpu_startup_entry+0x19/0x20
[ 1916.895705]  start_secondary+0x159/0x1a0
[ 1916.895709]  secondary_startup_64+0xa4/0xb0
[ 1916.895767] WARNING: CPU: 2 PID: 0 at ./include/linux/dma-fence.h:532 i915_request_skip+0xa8/0xc0 [i915]
[ 1916.895769] ---[ end trace 0ec701c0ac1a866b ]---
----------------------------------------------------------------
Comment 10 Mark Janes 2019-09-05 22:50:54 UTC
I tried to reproduce this, and failed.  Can you see if it reproduces with a stock kernel?
Comment 11 Eero Tamminen 2019-09-06 08:26:13 UTC
Created attachment 145272 [details]
SkullCanyon GPU hang error state from boot + running HdrBloom 10x with Iris
Comment 12 Eero Tamminen 2019-09-06 11:54:45 UTC
Last night there was a system hang in HdrBloom with Iris on BDW GT2 when using latest Mesa and drm-tip kernel git version -> issue isn't GEN9(+) specific.


(In reply to Mark Janes from comment #10)
> I tried to reproduce this, and failed.
> Can you see if it reproduces with a stock kernel?

I'm not sure what you mean by "stock" kernel.

I tested with Ubuntu 18.04 HWE kernel 5.0.0 on SkullCanyon, and wasn't able to reproduce the GPU hang with 40x repeats, so it seems to require newer kernel.


drm-tip kernel bisect:

* v5.1 (from early May), with rest being latest: not able to reproduce within 20x rounds

* v5.2-rc3 (from early June): not able to reproduce within 20x rounds

* v5.2 (1804 from early July): hang within 5x rounds

* v5.2-rc6 (1790): hang within 5x rounds

* v5.2-rc4 (1780): not able to reproduce within 20x rounds

* v5.2-rc5 (1785): not able to reproduce within 20x rounds

* v5.2-rc5 (1788): hang within 15x rounds

* v5.2-rc5 (1786): not able to reproduce within 20x rounds (+ 5x Multithread)

-> drm-tip v5.2-rc5 or newer kernel needed to reproduce

(Numbers in parenthesis above are our build IDs, build 1787 was for some reason rc4 with with week earlier commit date than expected.)


Bisect rest:

* On drm-tip v5.2 & Iris from early July: hang within 5x rounds

-> modifier support & newer Iris changes not needed for triggering

* On whole gfx stack from early July: hang within 10x rounds

* On whole gfx stack from early July, but Iris from early May: hang within 10x rounds

* On drm-tip v5.2, with rest of stack for early May: hang within 10x rounds

-> I.e. this is actually drm-tip/kernel regression, not Mesa one (as already hinted by recovery sometimes failing). Moving to drm component.

Hangs seem to have started somewhere between following drm-tip commits:
* 1180972dbd2a00f60a4d707772bd7e7ae6732ed5 drm-tip: 2019y-06m-20d-15h-39m-16s UTC integration manifest
* 7ff7b7a9d09acaa647921780fa5ed3525ab8f278 drm-tip: 2019y-06m-21d-23h-53m-21s UTC integration manifest


With latest gfx stack, hang seems to be somewhat more likely, than with a user-space gfx stack from few months ago.


(In reply to Eero Tamminen from comment #2)
> Got a (non-recoverable) HdrBloom hang also with i965 when using latest Git
> gfx stack, on SKL GT4e (SkullCanyon), so this might not be Iris specific
> issue.

This was the only time it happened with HdrBloom using i965 driver, and only time when i915 error state specified in which process (X) the hangs happens during HdrBloom running.

I'm not able to reproduce hang on SkullCanyon with v5.2 drm-tip using i965, with 25x rounds.  No idea why Iris triggers this bug so easily, but not i965.  Does i915 rely on i965 re/setting some extra state?
Comment 13 Eero Tamminen 2019-09-06 11:59:36 UTC
So far this has happened now on all Core devices I have (BDW, SKL, KBL), but I haven't yet seen it on (BXT) Atoms.  This may be just luck due to device speed.
Comment 14 Eero Tamminen 2019-09-10 11:32:57 UTC
This happens very frequently in the nightly testing, ruins results for following 3D tests and slows them down things so much that testing often times out -> raising to critical
Comment 15 Lakshmi 2019-09-11 08:27:00 UTC
(In reply to Eero Tamminen from comment #13)
> So far this has happened now on all Core devices I have (BDW, SKL, KBL), but
> I haven't yet seen it on (BXT) Atoms.  This may be just luck due to device
> speed.

(In reply to Eero Tamminen from comment #14)
> This happens very frequently in the nightly testing, ruins results for
> following 3D tests and slows them down things so much that testing often
> times out -> raising to critical

Setting the priority to highest considering this issue as regression and its impact.
Comment 16 Eero Tamminen 2019-09-16 08:09:51 UTC
(In reply to Eero Tamminen from comment #14)
> This happens very frequently in the nightly testing, ruins results for
> following 3D tests and slows them down things so much that testing often
> times out -> raising to critical

Later tests have started to succeed, only SynMark HdrBloom, and sometimes SynMark Multithread test following it, get (recoverable) GPU hangs. No idea whether this improvement is due to change in drm-tip, or Mesa, or just chance.
Comment 17 Eero Tamminen 2019-09-17 10:23:10 UTC
Has anybody tried investigating this higest/critical drm-tip regression with Iris (introduced in v5.2-rc5)?  As can be seen from comment 12, it's easy to reproduce.

Chris?
Comment 18 Eero Tamminen 2019-09-18 09:44:36 UTC
(In reply to Eero Tamminen from comment #16)
> Later tests have started to succeed, only SynMark HdrBloom, and sometimes
> SynMark Multithread test following it, get (recoverable) GPU hangs. No idea
> whether this improvement is due to change in drm-tip, or Mesa, or just
> chance.

It was chance.  All GPU tests (both 3D & media) are again failing after HdrBloom GPU hangs (on SKL GT2 & GT4e, KBL GT3e looks better).
Comment 19 Francesco Balestrieri 2019-09-19 13:05:19 UTC
Moving to high since it works on i965 and Mark couldn't reproduce in Mesa CI.
Comment 20 Eero Tamminen 2019-09-19 13:39:27 UTC
(In reply to Francesco Balestrieri from comment #19)
> Moving to high since it works on i965

I've gotten this once also with i965 (not daily like with Iris), see comment 2.


> and Mark couldn't reproduce in Mesa CI.

Mark is using distro kernels in Mesa CI, not drm-tip, i.e. he can't reproduce this because Mesa CI kernel version is too old, it's from before the regression.  Mesa CI would hit it if they update to new enough kernel.

(This was first thought to be Mesa issue, not kernel one.)
Comment 21 Eero Tamminen 2019-09-20 09:31:29 UTC
(In reply to Eero Tamminen from comment #20)
> (In reply to Francesco Balestrieri from comment #19)
> > Moving to high since it works on i965
> 
> I've gotten this once also with i965 (not daily like with Iris), see comment 2

Note also that Iris is already default driver on GEN11+, and Mesa team is planning making it be the default driver for GEN8+ before end of the year (in Mesa 19.3 release).

(If you loan us ICL HW, I can run tests to check which of the GPU resets and system hangs I've reported against GEN9, happen also on ICL.)
Comment 22 Francesco Balestrieri 2019-09-20 11:36:33 UTC
Sure, it's still a high priority bug, we just have bigger fires at the moment...
Comment 23 Eero Tamminen 2019-09-20 15:03:39 UTC
This happens also on ICL, so I assume it can be triggered on all Iris supported (=GEN8+) Core platforms.

I can't attach error state for ICL, because HdrBloom and rest of hangs happened after there had been a (recoverable) GPU hang in SynMark CSDof (compute / register spill case).  I'll try to get one next week, from just running HdrBloom on ICL (if I still have that machine on loan).
Comment 24 Francesco Balestrieri 2019-09-27 05:47:06 UTC
Given that this used to work in past kernels, would it be possible to bisect?
Comment 25 Eero Tamminen 2019-09-30 08:16:41 UTC
Situation seems to have improved during weekend.  I don't see anymore rest of the tests after HdrBloom & Multithread failing, and even those 2 tests now succeed (with lower FPS), although there are GPU hangs.  Hopefully that was real improvement and not just random / luck.


(In reply to Francesco Balestrieri from comment #24)
> Given that this used to work in past kernels, would it be possible to bisect?

In comment 12, I bisected it to within a day, rest is up to kernel team.  At least the test-case is simple, just running single program 20x times.

(My manager doesn't approve me spending time on debugging other teams bugs, but when test-case is easily reproducible and and I have pre-built daily kernels available like was the case here, I can bisect issues to a regressing day without it taking too much time.)
Comment 26 Francesco Balestrieri 2019-09-30 08:21:29 UTC
Thanks, yes my question was whether bisecting was possible, didn't mean that you should do it :)
Comment 27 Eero Tamminen 2019-10-03 09:32:04 UTC
(In reply to Eero Tamminen from comment #12)
> Last night there was a system hang in HdrBloom with Iris on BDW GT2 when
> using latest Mesa and drm-tip kernel git version
...
(In reply to Eero Tamminen from comment #25)
> Situation seems to have improved during weekend.  I don't see anymore rest
> of the tests after HdrBloom & Multithread failing, and even those 2 tests
> now succeed (with lower FPS), although there are GPU hangs.  Hopefully that
> was real improvement and not just random / luck.

It was random.  On SkullCanyon, Multithread test and tests following that failed, until machine hanged completely (still showing Multithread frame on screen), with latest drm-tip kernel & 3D stack.
Comment 28 Eero Tamminen 2019-10-09 14:45:38 UTC
Created attachment 145685 [details]
ICL-U D1 (recoverable) GPU hang error state

It's been nearly a week with GPU hang recovery working OK.

Same as with bug 111936, drm.debug=0x2 doesn't tell anything extra about this GPU hang.

Attached is ICL error state from hour old drm-tip & Mesa git versions:
-----------------------------------------------------------------
[    7.334758] fuse: init (API version 7.31)
[   98.083248] Iteration 1/3: synmark2 OglHdrBloom
[  106.009380] i915 0000:00:02.0: GPU HANG: ecode 11:1:0x00000000, hang on rcs0
[  106.009383] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  106.009384] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  106.009385] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  106.009386] The GPU crash dump is required to analyze GPU hangs, so please always attach it.
[  106.009387] GPU crash dump saved to /sys/class/drm/card0/error
[  106.009451] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  114.006297] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  122.006304] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  130.008231] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  130.127016] Iteration 2/3: synmark2 OglHdrBloom
[  138.006291] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  146.006294] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  154.007309] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  162.007300] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  162.122292] Iteration 3/3: synmark2 OglHdrBloom
[  170.006292] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  178.006295] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  186.006294] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  194.007291] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
-----------------------------------------------------------------
Comment 29 Eero Tamminen 2019-10-22 08:11:29 UTC
GPU recovery has been working OK for past weeks,so that part of the issue may have been fixed (for now I have data only from SKL & KBL, not ICL).

On KBL, recoverable GPU hangs happen now with OglTerrainTess test (bug 111936) instead of HDRBloom.  On SKL, hangs continue with both of these tests.
Comment 30 Eero Tamminen 2019-10-22 08:28:46 UTC
(In reply to Eero Tamminen from comment #29)
> On KBL, recoverable GPU hangs happen now with OglTerrainTess test (bug
> 111936) instead of HDRBloom.  On SKL, hangs continue with both of these
> tests.

Correction: they still happen in HdrBloom, just not an every run anymore, but more like one run out of 5-10 (still reproducible just by running the test in loop after boot with MESA_LOADER_DRIVER_OVERRIDE=iris).
Comment 31 Eero Tamminen 2019-10-29 16:08:03 UTC
(In reply to Eero Tamminen from comment #29)
> GPU recovery has been working OK for past weeks,so that part of the issue
> may have been fixed (for now I have data only from SKL & KBL, not ICL).

SKL GT2 & KLB GT3e didn't recover from it last night, all GPU tests after that failed too (I'm missing data since Friday, so I'm not sure when recovery failues started again).
Comment 32 Martin Peres 2019-11-29 19:22:57 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/366.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.