Bug 103556 - [SKL] Unigine Heaven 4.0 fails to GPU hangs
Summary: [SKL] Unigine Heaven 4.0 fails to GPU hangs
Status: REOPENED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-11-03 10:58 UTC by Eero Tamminen
Modified: 2019-08-19 08:26 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
SKL GT3e error state (first hang) (312.54 KB, text/plain)
2017-11-03 10:58 UTC, Eero Tamminen
Details
SKL GT2 error state (only hang on it so far) (328.31 KB, text/plain)
2017-11-03 10:59 UTC, Eero Tamminen
Details
SKL GT3e error state (latest hang) (303.22 KB, text/plain)
2017-11-03 11:00 UTC, Eero Tamminen
Details
SKL GT3e error state (earlier hang with v4.12 drm-tip kernel) (300.12 KB, text/plain)
2017-11-03 11:16 UTC, Eero Tamminen
Details
VAAPI Youtube Hang (36.17 KB, text/plain)
2017-11-07 07:52 UTC, Rouven Czerwinski (Emantor)
Details
SKL GT4e error state (drm-tip 5.0 kernel) (309.92 KB, text/plain)
2019-03-15 11:42 UTC, Eero Tamminen
Details
i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS (1.76 KB, patch)
2019-03-15 14:24 UTC, Lionel Landwerlin
Details | Splinter Review
~/.Heaven/heaven_4.0.cfg (9.31 KB, application/xml)
2019-03-15 17:08 UTC, Eero Tamminen
Details
SKL GT4e error state (drm-tip 5.0 kernel, Mesa 158d45db0c) (308.18 KB, text/plain)
2019-03-18 11:12 UTC, Eero Tamminen
Details
ICL-B4 error state for Valley GPU hang (419.53 KB, text/plain)
2019-03-18 16:55 UTC, Eero Tamminen
Details
BDW GT2 error state (drm-tip 5.1.0-rc1 kernel, Mesa 3c3f2504566) (332.39 KB, text/plain)
2019-03-21 15:25 UTC, Eero Tamminen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2017-11-03 10:58:13 UTC
Created attachment 135222 [details]
SKL GT3e error state (first hang)

In last 2 months there have been GPU hangs with Unigine Heaven v4.0 on SKL GT2, GT3e & GT4e:
--------------------------
[ 1328.788039] [drm] GPU HANG: ecode 9:0:0x84df7ec4, in heaven_x64 [2197], reason: Hang on rcs0, action: reset
--------------------------

Heaven settings are: quality = high, filtering = trilinear + 4x anitsotropic, no AA, Vsync disabled, Tessellation enabled.

These GPU hangs happen both with 4.13 drm-tip kernel (from early September), and latest drm-tip kernel, with both kernels, the hangs happen with the same Mesa commits.

They don't happen on multiple machines with same commit, only on single machine, and even then not on every run, so I think triggering of the hang is very timing related.

First hang was on SKL GT3e:
2017-10-03 df6b320a83c89b6401fde888375529b3fc66f4fa

Second one on SKL GT4e:
2017-10-04 844ae722c4416420f961ce8a89b5e5278865376c

Only one on SKL GT2:
2017-10-18 f37af5ec8d351fe20e74b05059bea12236220e02

Latest ones happened two days in row on SKL GT4:
2017-11-01 8d8b9d11c97a679c0954a2f2e7ed8ddcd248ccfa
2017-11-02 a29869e8720b385d3692f6a74de2921412b2c8c1
Comment 1 Eero Tamminen 2017-11-03 10:59:39 UTC
Created attachment 135223 [details]
SKL GT2 error state (only hang on it so far)
Comment 2 Eero Tamminen 2017-11-03 11:00:36 UTC
Created attachment 135224 [details]
SKL GT3e error state (latest hang)
Comment 3 Eero Tamminen 2017-11-03 11:16:38 UTC
Created attachment 135225 [details]
SKL GT3e error state (earlier hang with v4.12 drm-tip kernel)

Correction: there were couple of hangs with SKL GT3e & GT4e also in September and in August (these were with v4.12 drm-tip kernel).

Earliest SKL GT3e hang for which I still found data is for:
2017-08-25 1eb58960bfd30d575cca4fa3c600512751aab467
Comment 4 Rouven Czerwinski (Emantor) 2017-11-07 07:52:15 UTC
Created attachment 135278 [details]
VAAPI Youtube Hang

Experienced a hang with the same error message while watching youtube via mpv.
i5-6300U, Kernel 4.13.11, Mesa 17.2.4
Comment 5 Eero Tamminen 2017-11-07 09:19:24 UTC
(In reply to rouven from comment #4)
> Created attachment 135278 [details]
> VAAPI Youtube Hang
> 
> Experienced a hang with the same error message while watching youtube via
> mpv.
> i5-6300U, Kernel 4.13.11, Mesa 17.2.4

Please file a separate bug for that, it's a completely different use-case (video vs. 3D).
Comment 6 Eero Tamminen 2018-01-15 16:11:42 UTC
With Mesa git head and few months old 4.13 drm-tip kernel:
* Last SKL GT2 & GT4e hang is from 2 weeks ago
* Several SKL GT4e hangs in November
* Hangs on SKL GT2 on mid-October & mid-December
* no hangs on SKL GT3e since start of November

With drm-tip kernel and few months old Mesa git:
* On SKL GT2, only visible hang after start of November is hang on mid-December
* No visible hangs on SKL GT3e / GT4e in past 3 months

With latest Mesa & drm-tip kernel git versions:
* No visible hangs since beginning of November

-> It seems this could be more kernel than Mesa related, and potentially fixed there.

(Or timings have changed so that it doesn't appear with latest git versions.)
Comment 7 Danylo 2018-06-12 14:59:27 UTC
Hello, I've managed to reproduced the issue when running Unigine Heaven 4.0 under Wine in directx11 mode (I'm not sure about your setup, maybe it was native Unigine Heaven with opengl, but I failed to reproduce hang in it). The hang happens in 100% of the runs. I have tested it on drm-tip kernel 4.17 and on 4.15 with latest Mesa and 17.2.8. I have HD Graphics 530 (Skylake GT2).
I've also got an api trace which leads to the hang but failed to reduce it or find the issue. Apitrace: https://mega.nz/#!RJMEHTrD!91D34TtyY3OqtNPwanXU8UJ5uqk8g4-2V2wUV0CfE1o. Hang happens in call 256092. The hang will be gone if nothing is painted in 256092 call e.g. draw zero triangles.
Comment 8 Danylo 2018-10-31 11:01:12 UTC
Since the issue which hanged Unigine Heaven running under Wine got solved:
https://bugs.freedesktop.org/show_bug.cgi?id=107088

And there is no hangs with OpenGL and no new reports - I would consider this solved.
Comment 9 Eero Tamminen 2019-03-15 11:41:41 UTC
There are still GPU hangs in Heaven, in end of November:
  https://bugs.freedesktop.org/show_bug.cgi?id=108820#c3

(Heaven doesn't use compute, so it's unlikely to relate to bug 108820.)

And with few days old Mesa + drm-git kernel, at least on SKL GT4e, when running Heaven (with tessellation) under XWayland/Weston.  See attached error state.
Comment 10 Eero Tamminen 2019-03-15 11:42:49 UTC
Created attachment 143677 [details]
SKL GT4e error state (drm-tip 5.0 kernel)
Comment 11 Eero Tamminen 2019-03-15 12:08:48 UTC
Hm. Heaven hang dmesg output for the attached error state mentions Weston:
----------------------------------------------------
[  926.608756] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in  [0], hang on rcs0
[  926.608757] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  926.608758] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  926.608758] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  926.608759] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  926.608759] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  926.609767] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  930.592998] Asynchronous wait on fence i915:weston[644]/1:908a timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[  934.560031] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[  941.851376] Asynchronous wait on fence i915:weston[644]/1:908e timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915])
[  942.556066] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
----------------------------------------------------

Maybe this slightly earlier Heaven hang is related to bug 110131 which I just filed?

(Bug 110131 hangs started after switching from Ubuntu 18.04 desktop with Git built Xserver to Git built libwayland/Weston/XWayland.)
Comment 12 Lionel Landwerlin 2019-03-15 14:24:59 UTC
Created attachment 143681 [details] [review]
i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS

Thanks for the error states Eero.
Can I bother you with running this simple patch? Might fix it, not sure...
Comment 13 Eero Tamminen 2019-03-15 17:00:25 UTC
I'll run several rounds (20) of Heaven with it and let you know on Monday whether there were any hangs (as the hangs are pretty random, them missing isn't full guarantee though).
Comment 14 Lionel Landwerlin 2019-03-15 17:00:43 UTC
Unfortunately, I'm not able to reproduce locally.
I wondering if we have different version of unigine, because I'm using v4.0 too, but I don't have a filtering setting.
Comment 15 Eero Tamminen 2019-03-15 17:08:15 UTC
Created attachment 143682 [details]
~/.Heaven/heaven_4.0.cfg

I think you should get the options from the dialog that opens from buttons at top of the screen when benchmark is running, but couldn't try it right now.  After setting options, Heaven saves them to config file (attached) for further runs.
Comment 16 Eero Tamminen 2019-03-18 11:12:07 UTC
Created attachment 143723 [details]
SKL GT4e error state (drm-tip 5.0 kernel, Mesa 158d45db0c)

During last 4 nights, with 3 runs of Heaven each day:
- with different kernel upstream commits, rest of gfx stack 2 weeks old
- with latest Mesa upstream commits (rest of gfx stack 2 weeks old)
- whole gfx stack using latest upstream commits

Each of these setups had 1 Heaven GPU hang, out of total 12 runs.
=> Mesa TS/GS perf fix not relevant for hangs


(In reply to Lionel Landwerlin from comment #12)
> Created attachment 143681 [details] [review] [review]
> i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS
> 
> Thanks for the error states Eero.
> Can I bother you with running this simple patch? Might fix it, not sure...

21 successive rounds of Heaven went without hangs so it looks good, but I don't know whether there's some pre-condition before the hang happens.
Comment 17 Lionel Landwerlin 2019-03-18 11:26:33 UTC
(In reply to Eero Tamminen from comment #16)
> Created attachment 143723 [details]
> SKL GT4e error state (drm-tip 5.0 kernel, Mesa 158d45db0c)
> 
> During last 4 nights, with 3 runs of Heaven each day:
> - with different kernel upstream commits, rest of gfx stack 2 weeks old
> - with latest Mesa upstream commits (rest of gfx stack 2 weeks old)
> - whole gfx stack using latest upstream commits
> 
> Each of these setups had 1 Heaven GPU hang, out of total 12 runs.
> => Mesa TS/GS perf fix not relevant for hangs
> 
> 
> (In reply to Lionel Landwerlin from comment #12)
> > Created attachment 143681 [details] [review] [review] [review]
> > i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS
> > 
> > Thanks for the error states Eero.
> > Can I bother you with running this simple patch? Might fix it, not sure...
> 
> 21 successive rounds of Heaven went without hangs so it looks good, but I
> don't know whether there's some pre-condition before the hang happens.

No quite sure how to read you comment.
You still to still get a hang even with that patch?

This last error state is really interesting. The first draw call in there doesn't have 3DSTATE_TE programmed. This patch should have fixed that.
If this happened with the patch, it means something else is wrong.
Comment 18 Eero Tamminen 2019-03-18 11:42:25 UTC
(In reply to Lionel Landwerlin from comment #17)
> No quite sure how to read you comment.
> You still to still get a hang even with that patch?

No, that was with the normal upstream component testing which doesn't patch anything (any applied patches would get soon stale and fail automated builds), to see whether minor gfx stack differences affect the hangs (and they didn't).


> This last error state is really interesting.

I added it so that you can check whether additional hangs (with latest, unpatched Mesa) are also for the same reason.


> The first draw call in there
> doesn't have 3DSTATE_TE programmed. This patch should have fixed that.
> If this happened with the patch, it means something else is wrong.

With the patch I wasn't able to reproduce hangs (within 21 rounds, whereas unpatched tests got one hang within 12 rounds, so it looks good).
Comment 19 Eero Tamminen 2019-03-18 11:44:00 UTC
Can you see from the earlier attached error states whether the last year SKL GT3e Heaven hangs were for the same reason?
Comment 20 Lionel Landwerlin 2019-03-18 11:46:26 UTC
(In reply to Eero Tamminen from comment #19)
> Can you see from the earlier attached error states whether the last year SKL
> GT3e Heaven hangs were for the same reason?

Yeah the first error state on GT3e has the same characteristic, enabled HS/DS stages but leaves TE disabled.
Comment 21 Lionel Landwerlin 2019-03-18 11:46:46 UTC
(In reply to Eero Tamminen from comment #18)
> (In reply to Lionel Landwerlin from comment #17)
> > No quite sure how to read you comment.
> > You still to still get a hang even with that patch?
> 
> No, that was with the normal upstream component testing which doesn't patch
> anything (any applied patches would get soon stale and fail automated
> builds), to see whether minor gfx stack differences affect the hangs (and
> they didn't).
> 
> 
> > This last error state is really interesting.
> 
> I added it so that you can check whether additional hangs (with latest,
> unpatched Mesa) are also for the same reason.
> 
> 
> > The first draw call in there
> > doesn't have 3DSTATE_TE programmed. This patch should have fixed that.
> > If this happened with the patch, it means something else is wrong.
> 
> With the patch I wasn't able to reproduce hangs (within 21 rounds, whereas
> unpatched tests got one hang within 12 rounds, so it looks good).

Thanks a bunch, I'll submit upstream for review.
Comment 22 Eero Tamminen 2019-03-18 15:23:06 UTC
FYI: I'm seeing recoverable GPU hangs also on ICL B4, both in Heaven & Valley (and SynMark CSDof).  Do you want error state for that too?
Comment 23 Lionel Landwerlin 2019-03-18 15:30:01 UTC
(In reply to Eero Tamminen from comment #22)
> FYI: I'm seeing recoverable GPU hangs also on ICL B4, both in Heaven &
> Valley (and SynMark CSDof).  Do you want error state for that too?

If this is without the patch attached here, I'm fairly confident this is the same problem.
The broken logic applies to all generations supporting tesselation.

If you have it already, just attach it and I'll look at it.
Comment 24 Eero Tamminen 2019-03-18 16:55:32 UTC
Created attachment 143728 [details]
ICL-B4 error state for Valley GPU hang
Comment 25 Lionel Landwerlin 2019-03-18 17:09:13 UTC
(In reply to Eero Tamminen from comment #24)
> Created attachment 143728 [details]
> ICL-B4 error state for Valley GPU hang

Hmm.. this one doesn't have any tesselation enabled :/
So it's probably a different bug :(
Comment 26 Eero Tamminen 2019-03-19 08:50:00 UTC
(In reply to Lionel Landwerlin from comment #25)
> (In reply to Eero Tamminen from comment #24)
> > Created attachment 143728 [details]
> > ICL-B4 error state for Valley GPU hang
> 
> Hmm.. this one doesn't have any tesselation enabled :/
> So it's probably a different bug :(

Ok.  Valley hanged before Heaven, so I didn't get one from Heaven.

What about error state in tessellation hang bug 110131, is it due to same issue as Heaven hang i.e. duplicate of this bug?
Comment 27 Eero Tamminen 2019-03-21 15:25:40 UTC
Created attachment 143748 [details]
BDW GT2 error state (drm-tip 5.1.0-rc1 kernel, Mesa 3c3f2504566)

Saw recoverable Heaven hang also on BDW GT2, so it's possible that this bug isn't SKL specific.
Comment 28 Eero Tamminen 2019-03-21 16:41:44 UTC
(In reply to Eero Tamminen from comment #27)
> Saw recoverable Heaven hang also on BDW GT2, so it's possible that this bug
> isn't SKL specific.

According to Lionel it is -> removing SKL prefix.
Comment 29 Lionel Landwerlin 2019-03-26 12:09:55 UTC
(In reply to Danylo from comment #7)
> Hello, I've managed to reproduced the issue when running Unigine Heaven 4.0
> under Wine in directx11 mode (I'm not sure about your setup, maybe it was
> native Unigine Heaven with opengl, but I failed to reproduce hang in it).
> The hang happens in 100% of the runs. I have tested it on drm-tip kernel
> 4.17 and on 4.15 with latest Mesa and 17.2.8. I have HD Graphics 530
> (Skylake GT2).
> I've also got an api trace which leads to the hang but failed to reduce it
> or find the issue. Apitrace:
> https://mega.nz/#!RJMEHTrD!91D34TtyY3OqtNPwanXU8UJ5uqk8g4-2V2wUV0CfE1o. Hang
> happens in call 256092. The hang will be gone if nothing is painted in
> 256092 call e.g. draw zero triangles.

I couldn't reproduce this hang on master and tracked the fix down to :

commit eca4a6548d07bbbb02a7768edb397bad7b72cfc2
Author: Danylo Piliaiev <danylo.piliaiev@gmail.com>
Date:   Mon Jul 2 17:04:23 2018 +0300

    i965: Disable dual source blending when shader doesn't support it on gen8+
    
    Dual source blending behaviour is undefined when shader doesn't
    have second color output, dismissing fragment in such situation
    leads to a hang on gen8+ if depth test in enabled.
    
    Since blending cannot be gracefully fixed in such case and the result
    is undefined - blending is simply disabled.
    
    v2 (Kenneth Graunke):
     - Listen to BRW_NEW_FS_PROG_DATA in 3DSTATE_PS_BLEND
     - Also whack BLEND_STATE[] to keep the two in sync, since we're not
       sure exactly which copy of the redundant info the hardware will use.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107088
    Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com>
    Reviewed-by: Jason Ekstrand <jason@jlekstrand.net>
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>

Anyway, it's clearly a different issue as this trace doesn't have any tesselation enabled.
Comment 30 Eero Tamminen 2019-03-26 14:03:39 UTC
The hang issues I've filed (this and compute) one are too rare/random to be bisectable.

In last few days, I got one hang in Heaven on SKL GT2 two days ago, one hang on SKL GT3e yesterday (and one hang in CarChase on SKL GT4e yesterday).  This is from doing 3 runs, few times on different gfx stack setups every day, on each of the machines.

On other machines than SKL, they're much rarer (attached BDW hang is the only one I've noticed).
Comment 31 Lionel Landwerlin 2019-03-26 14:09:26 UTC
(In reply to Eero Tamminen from comment #30)
> The hang issues I've filed (this and compute) one are too rare/random to be
> bisectable.
> 
> In last few days, I got one hang in Heaven on SKL GT2 two days ago, one hang
> on SKL GT3e yesterday (and one hang in CarChase on SKL GT4e yesterday). 
> This is from doing 3 runs, few times on different gfx stack setups every
> day, on each of the machines.
> 
> On other machines than SKL, they're much rarer (attached BDW hang is the
> only one I've noticed).

I'm starting to think we might have a problem with the tracking of the aperture available.
By default it seems to be set to 4Gb, but that doesn't make sense to me on gen8+, I would expect 2^48 (48bits of address space).
Reducing that number to like 200Mb in i965 triggers all kind of random hangs with Heaven.

4Gb might be big enough to only trigger issues after a long time.
It would explain the difficulty to reproduce.

Anyway that's my current track, I'll try to understand this better with Ken.
Comment 32 Eero Tamminen 2019-08-19 08:26:50 UTC
Started testing few days again with SKL GT4e, and got again recoverable Heaven hang, with drm-tip git kernel v5.2, and Mesa git, so these are still happening.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.