Created attachment 135222 [details] SKL GT3e error state (first hang) In last 2 months there have been GPU hangs with Unigine Heaven v4.0 on SKL GT2, GT3e & GT4e: -------------------------- [ 1328.788039] [drm] GPU HANG: ecode 9:0:0x84df7ec4, in heaven_x64 [2197], reason: Hang on rcs0, action: reset -------------------------- Heaven settings are: quality = high, filtering = trilinear + 4x anitsotropic, no AA, Vsync disabled, Tessellation enabled. These GPU hangs happen both with 4.13 drm-tip kernel (from early September), and latest drm-tip kernel, with both kernels, the hangs happen with the same Mesa commits. They don't happen on multiple machines with same commit, only on single machine, and even then not on every run, so I think triggering of the hang is very timing related. First hang was on SKL GT3e: 2017-10-03 df6b320a83c89b6401fde888375529b3fc66f4fa Second one on SKL GT4e: 2017-10-04 844ae722c4416420f961ce8a89b5e5278865376c Only one on SKL GT2: 2017-10-18 f37af5ec8d351fe20e74b05059bea12236220e02 Latest ones happened two days in row on SKL GT4: 2017-11-01 8d8b9d11c97a679c0954a2f2e7ed8ddcd248ccfa 2017-11-02 a29869e8720b385d3692f6a74de2921412b2c8c1
Created attachment 135223 [details] SKL GT2 error state (only hang on it so far)
Created attachment 135224 [details] SKL GT3e error state (latest hang)
Created attachment 135225 [details] SKL GT3e error state (earlier hang with v4.12 drm-tip kernel) Correction: there were couple of hangs with SKL GT3e & GT4e also in September and in August (these were with v4.12 drm-tip kernel). Earliest SKL GT3e hang for which I still found data is for: 2017-08-25 1eb58960bfd30d575cca4fa3c600512751aab467
Created attachment 135278 [details] VAAPI Youtube Hang Experienced a hang with the same error message while watching youtube via mpv. i5-6300U, Kernel 4.13.11, Mesa 17.2.4
(In reply to rouven from comment #4) > Created attachment 135278 [details] > VAAPI Youtube Hang > > Experienced a hang with the same error message while watching youtube via > mpv. > i5-6300U, Kernel 4.13.11, Mesa 17.2.4 Please file a separate bug for that, it's a completely different use-case (video vs. 3D).
With Mesa git head and few months old 4.13 drm-tip kernel: * Last SKL GT2 & GT4e hang is from 2 weeks ago * Several SKL GT4e hangs in November * Hangs on SKL GT2 on mid-October & mid-December * no hangs on SKL GT3e since start of November With drm-tip kernel and few months old Mesa git: * On SKL GT2, only visible hang after start of November is hang on mid-December * No visible hangs on SKL GT3e / GT4e in past 3 months With latest Mesa & drm-tip kernel git versions: * No visible hangs since beginning of November -> It seems this could be more kernel than Mesa related, and potentially fixed there. (Or timings have changed so that it doesn't appear with latest git versions.)
Hello, I've managed to reproduced the issue when running Unigine Heaven 4.0 under Wine in directx11 mode (I'm not sure about your setup, maybe it was native Unigine Heaven with opengl, but I failed to reproduce hang in it). The hang happens in 100% of the runs. I have tested it on drm-tip kernel 4.17 and on 4.15 with latest Mesa and 17.2.8. I have HD Graphics 530 (Skylake GT2). I've also got an api trace which leads to the hang but failed to reduce it or find the issue. Apitrace: https://mega.nz/#!RJMEHTrD!91D34TtyY3OqtNPwanXU8UJ5uqk8g4-2V2wUV0CfE1o. Hang happens in call 256092. The hang will be gone if nothing is painted in 256092 call e.g. draw zero triangles.
Since the issue which hanged Unigine Heaven running under Wine got solved: https://bugs.freedesktop.org/show_bug.cgi?id=107088 And there is no hangs with OpenGL and no new reports - I would consider this solved.
There are still GPU hangs in Heaven, in end of November: https://bugs.freedesktop.org/show_bug.cgi?id=108820#c3 (Heaven doesn't use compute, so it's unlikely to relate to bug 108820.) And with few days old Mesa + drm-git kernel, at least on SKL GT4e, when running Heaven (with tessellation) under XWayland/Weston. See attached error state.
Created attachment 143677 [details] SKL GT4e error state (drm-tip 5.0 kernel)
Hm. Heaven hang dmesg output for the attached error state mentions Weston: ---------------------------------------------------- [ 926.608756] i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in [0], hang on rcs0 [ 926.608757] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 926.608758] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 926.608758] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 926.608759] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 926.608759] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 926.609767] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [ 930.592998] Asynchronous wait on fence i915:weston[644]/1:908a timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915]) [ 934.560031] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 [ 941.851376] Asynchronous wait on fence i915:weston[644]/1:908e timed out (hint:intel_atomic_commit_ready+0x0/0x54 [i915]) [ 942.556066] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 ---------------------------------------------------- Maybe this slightly earlier Heaven hang is related to bug 110131 which I just filed? (Bug 110131 hangs started after switching from Ubuntu 18.04 desktop with Git built Xserver to Git built libwayland/Weston/XWayland.)
Created attachment 143681 [details] [review] i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS Thanks for the error states Eero. Can I bother you with running this simple patch? Might fix it, not sure...
I'll run several rounds (20) of Heaven with it and let you know on Monday whether there were any hangs (as the hangs are pretty random, them missing isn't full guarantee though).
Unfortunately, I'm not able to reproduce locally. I wondering if we have different version of unigine, because I'm using v4.0 too, but I don't have a filtering setting.
Created attachment 143682 [details] ~/.Heaven/heaven_4.0.cfg I think you should get the options from the dialog that opens from buttons at top of the screen when benchmark is running, but couldn't try it right now. After setting options, Heaven saves them to config file (attached) for further runs.
Created attachment 143723 [details] SKL GT4e error state (drm-tip 5.0 kernel, Mesa 158d45db0c) During last 4 nights, with 3 runs of Heaven each day: - with different kernel upstream commits, rest of gfx stack 2 weeks old - with latest Mesa upstream commits (rest of gfx stack 2 weeks old) - whole gfx stack using latest upstream commits Each of these setups had 1 Heaven GPU hang, out of total 12 runs. => Mesa TS/GS perf fix not relevant for hangs (In reply to Lionel Landwerlin from comment #12) > Created attachment 143681 [details] [review] [review] > i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS > > Thanks for the error states Eero. > Can I bother you with running this simple patch? Might fix it, not sure... 21 successive rounds of Heaven went without hangs so it looks good, but I don't know whether there's some pre-condition before the hang happens.
(In reply to Eero Tamminen from comment #16) > Created attachment 143723 [details] > SKL GT4e error state (drm-tip 5.0 kernel, Mesa 158d45db0c) > > During last 4 nights, with 3 runs of Heaven each day: > - with different kernel upstream commits, rest of gfx stack 2 weeks old > - with latest Mesa upstream commits (rest of gfx stack 2 weeks old) > - whole gfx stack using latest upstream commits > > Each of these setups had 1 Heaven GPU hang, out of total 12 runs. > => Mesa TS/GS perf fix not relevant for hangs > > > (In reply to Lionel Landwerlin from comment #12) > > Created attachment 143681 [details] [review] [review] [review] > > i965: align 3DSTATE_TE emission on 3DSTATE_DS/GS > > > > Thanks for the error states Eero. > > Can I bother you with running this simple patch? Might fix it, not sure... > > 21 successive rounds of Heaven went without hangs so it looks good, but I > don't know whether there's some pre-condition before the hang happens. No quite sure how to read you comment. You still to still get a hang even with that patch? This last error state is really interesting. The first draw call in there doesn't have 3DSTATE_TE programmed. This patch should have fixed that. If this happened with the patch, it means something else is wrong.
(In reply to Lionel Landwerlin from comment #17) > No quite sure how to read you comment. > You still to still get a hang even with that patch? No, that was with the normal upstream component testing which doesn't patch anything (any applied patches would get soon stale and fail automated builds), to see whether minor gfx stack differences affect the hangs (and they didn't). > This last error state is really interesting. I added it so that you can check whether additional hangs (with latest, unpatched Mesa) are also for the same reason. > The first draw call in there > doesn't have 3DSTATE_TE programmed. This patch should have fixed that. > If this happened with the patch, it means something else is wrong. With the patch I wasn't able to reproduce hangs (within 21 rounds, whereas unpatched tests got one hang within 12 rounds, so it looks good).
Can you see from the earlier attached error states whether the last year SKL GT3e Heaven hangs were for the same reason?
(In reply to Eero Tamminen from comment #19) > Can you see from the earlier attached error states whether the last year SKL > GT3e Heaven hangs were for the same reason? Yeah the first error state on GT3e has the same characteristic, enabled HS/DS stages but leaves TE disabled.
(In reply to Eero Tamminen from comment #18) > (In reply to Lionel Landwerlin from comment #17) > > No quite sure how to read you comment. > > You still to still get a hang even with that patch? > > No, that was with the normal upstream component testing which doesn't patch > anything (any applied patches would get soon stale and fail automated > builds), to see whether minor gfx stack differences affect the hangs (and > they didn't). > > > > This last error state is really interesting. > > I added it so that you can check whether additional hangs (with latest, > unpatched Mesa) are also for the same reason. > > > > The first draw call in there > > doesn't have 3DSTATE_TE programmed. This patch should have fixed that. > > If this happened with the patch, it means something else is wrong. > > With the patch I wasn't able to reproduce hangs (within 21 rounds, whereas > unpatched tests got one hang within 12 rounds, so it looks good). Thanks a bunch, I'll submit upstream for review.
FYI: I'm seeing recoverable GPU hangs also on ICL B4, both in Heaven & Valley (and SynMark CSDof). Do you want error state for that too?
(In reply to Eero Tamminen from comment #22) > FYI: I'm seeing recoverable GPU hangs also on ICL B4, both in Heaven & > Valley (and SynMark CSDof). Do you want error state for that too? If this is without the patch attached here, I'm fairly confident this is the same problem. The broken logic applies to all generations supporting tesselation. If you have it already, just attach it and I'll look at it.
Created attachment 143728 [details] ICL-B4 error state for Valley GPU hang
(In reply to Eero Tamminen from comment #24) > Created attachment 143728 [details] > ICL-B4 error state for Valley GPU hang Hmm.. this one doesn't have any tesselation enabled :/ So it's probably a different bug :(
(In reply to Lionel Landwerlin from comment #25) > (In reply to Eero Tamminen from comment #24) > > Created attachment 143728 [details] > > ICL-B4 error state for Valley GPU hang > > Hmm.. this one doesn't have any tesselation enabled :/ > So it's probably a different bug :( Ok. Valley hanged before Heaven, so I didn't get one from Heaven. What about error state in tessellation hang bug 110131, is it due to same issue as Heaven hang i.e. duplicate of this bug?
Created attachment 143748 [details] BDW GT2 error state (drm-tip 5.1.0-rc1 kernel, Mesa 3c3f2504566) Saw recoverable Heaven hang also on BDW GT2, so it's possible that this bug isn't SKL specific.
(In reply to Eero Tamminen from comment #27) > Saw recoverable Heaven hang also on BDW GT2, so it's possible that this bug > isn't SKL specific. According to Lionel it is -> removing SKL prefix.
(In reply to Danylo from comment #7) > Hello, I've managed to reproduced the issue when running Unigine Heaven 4.0 > under Wine in directx11 mode (I'm not sure about your setup, maybe it was > native Unigine Heaven with opengl, but I failed to reproduce hang in it). > The hang happens in 100% of the runs. I have tested it on drm-tip kernel > 4.17 and on 4.15 with latest Mesa and 17.2.8. I have HD Graphics 530 > (Skylake GT2). > I've also got an api trace which leads to the hang but failed to reduce it > or find the issue. Apitrace: > https://mega.nz/#!RJMEHTrD!91D34TtyY3OqtNPwanXU8UJ5uqk8g4-2V2wUV0CfE1o. Hang > happens in call 256092. The hang will be gone if nothing is painted in > 256092 call e.g. draw zero triangles. I couldn't reproduce this hang on master and tracked the fix down to : commit eca4a6548d07bbbb02a7768edb397bad7b72cfc2 Author: Danylo Piliaiev <danylo.piliaiev@gmail.com> Date: Mon Jul 2 17:04:23 2018 +0300 i965: Disable dual source blending when shader doesn't support it on gen8+ Dual source blending behaviour is undefined when shader doesn't have second color output, dismissing fragment in such situation leads to a hang on gen8+ if depth test in enabled. Since blending cannot be gracefully fixed in such case and the result is undefined - blending is simply disabled. v2 (Kenneth Graunke): - Listen to BRW_NEW_FS_PROG_DATA in 3DSTATE_PS_BLEND - Also whack BLEND_STATE[] to keep the two in sync, since we're not sure exactly which copy of the redundant info the hardware will use. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=107088 Signed-off-by: Danylo Piliaiev <danylo.piliaiev@globallogic.com> Reviewed-by: Jason Ekstrand <jason@jlekstrand.net> Reviewed-by: Kenneth Graunke <kenneth@whitecape.org> Anyway, it's clearly a different issue as this trace doesn't have any tesselation enabled.
The hang issues I've filed (this and compute) one are too rare/random to be bisectable. In last few days, I got one hang in Heaven on SKL GT2 two days ago, one hang on SKL GT3e yesterday (and one hang in CarChase on SKL GT4e yesterday). This is from doing 3 runs, few times on different gfx stack setups every day, on each of the machines. On other machines than SKL, they're much rarer (attached BDW hang is the only one I've noticed).
(In reply to Eero Tamminen from comment #30) > The hang issues I've filed (this and compute) one are too rare/random to be > bisectable. > > In last few days, I got one hang in Heaven on SKL GT2 two days ago, one hang > on SKL GT3e yesterday (and one hang in CarChase on SKL GT4e yesterday). > This is from doing 3 runs, few times on different gfx stack setups every > day, on each of the machines. > > On other machines than SKL, they're much rarer (attached BDW hang is the > only one I've noticed). I'm starting to think we might have a problem with the tracking of the aperture available. By default it seems to be set to 4Gb, but that doesn't make sense to me on gen8+, I would expect 2^48 (48bits of address space). Reducing that number to like 200Mb in i965 triggers all kind of random hangs with Heaven. 4Gb might be big enough to only trigger issues after a long time. It would explain the difficulty to reproduce. Anyway that's my current track, I'll try to understand this better with Ken.
Started testing few days again with SKL GT4e, and got again recoverable Heaven hang, with drm-tip git kernel v5.2, and Mesa git, so these are still happening.
Created attachment 145466 [details] KBL GT3e recoverable GPU hang in Unigine Valley 1.0 Seeing GPU hang also in KBL GT3e Unigine Valley.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1644.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.