Bug 92234 - [BDW] GPU hang in Shogun2
Summary: [BDW] GPU hang in Shogun2
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Matt Turner
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: bisected, regression
Depends on:
Blocks: 93185
  Show dependency treegraph
 
Reported: 2015-10-02 07:54 UTC by Pavel Ondračka
Modified: 2016-12-13 00:00 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Shaders output with and without hangs (345.70 KB, text/plain)
2015-12-08 14:06 UTC, pavel.e.popov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pavel Ondračka 2015-10-02 07:54:52 UTC
Originally posted in bug 91730, sorry about that...

When starting Total War: Shogun2, GPU hangs, program exits and "intel_do_flush_locked failed: Input/output error" is printed in the terminal.

I've managed to bisect it to:
commit f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff
Author: Matt Turner <mattst88@gmail.com>
Date:   Tue May 5 20:25:07 2015 -0700

    nir: Recognize (a < c || b < c) as min(a, b) < c.
    
    ... and (a >= c) || (b >= c) as max(a, b) >= c.
    
    Similar to commit 97e6c1b9.
    
    total instructions in shared programs: 6182276 -> 6182180 (-0.00%)
    instructions in affected programs:     6400 -> 6304 (-1.50%)
    helped:                                68
    HURT:                                  4
    
    Reviewed-by: Jordan Justen <jordan.l.justen@intel.com>
    Reviewed-by: Glenn Kennard <glenn.kennard@gmail.com>

It can be reproduced with this trace: http://pavel.ondracka.cz/Shogun2.trace

My system info:
GPU: Mesa DRI Intel(R) HD Graphics 5500 (Broadwell GT2)
Mesa: c0722be9f58ef89dae98d8c459ec4f9589f97748
kernel: 4.1.7-200.fc22.x86_64
libdrm: 94ecdcb8b11dd3eb6b047ad72030d775014aadee
xf86-video-intel: 679ee12079a7d2682d41506b81973c7c7d4fa1d8 (sna+dri3)
Comment 1 pavel.e.popov 2015-10-02 10:33:26 UTC
Thanks for this bug, Pavel!

I also observed GPU hangs playing traces recorded for similar games:
“Total War: Empire” and “Total War: Napoleon”.

I observed this issue only on BDW with Mesa 10.6.
No GPU hangs were observed on BDW with Mesa 10.4 (it doesn't have NIR).
No GPU hangs were observed on HSW with Mesa 10.4 and Mesa 10.6.

I tried to revert commit f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff but without success.
Comment 2 pavel.e.popov 2015-10-05 04:28:23 UTC
As I said previously I see hangs on Total War: Napoleon” and “Total War: Empire” traces on BDW with Mesa 10.6. I tried to revert commit "nir: Recognize (a < c || b < c) as min(a, b) < c." but without success.

So I tried to removed all optimizations in nir_opt_algebraic.py:
no hangs were observed on “Total War: Napoleon”
hangs were still observed on “Total War: Empire”.

Somehow NIR optimizations from nir_opt_algebraic.py lead to hangs in “Total War: Napoleon”. Looks like Pavel Ondračka observed the similar issue with "Total War: Shogun2".
Comment 3 pavel.e.popov 2015-10-05 08:55:47 UTC
Could reproduce hangs on BDW using the trace "Total War: Shogun2":
    http://pavel.ondracka.cz/Shogun2.trace

Found that NO hangs were observed with disabled Intel NIR on this trace: 
    export INTEL_USE_NIR=0

However this approach didn't work with my traces "Total War: Napoleon" and "Total War: Empire" (unfortunately, I couldn't share these not apitrace traces). Most likely it's not a NIR issue.
Comment 4 pavel.e.popov 2015-10-07 02:38:02 UTC
Found that hangs on "Total War: Napoleon" and "Total War: Empire" are gone without this commit (nir/opt_algebraic: Add some constant bcsel reductions):
    http://cgit.freedesktop.org/mesa/mesa/commit/?id=604ae33c8b95a97ba586780324566fd21c59b695

Observed that these additional optimizations somehow lead to hangs:
+# Add optimizations to handle the case where the result of a ternary is
+# compared to a constant.  This way we can take things like
+#
+# (a ? 0 : 1) > 0
+#
+# and turn it into
+#
+# a ? (0 > 0) : (1 > 0)
+#
+# which constant folding will eat for lunch.  The resulting ternary will
+# further get cleaned up by the boolean reductions above and we will be
+# left with just the original variable "a".
+for op in ['flt', 'fge', 'feq', 'fne',
+           'ilt', 'ige', 'ieq', 'ine', 'ult', 'uge']:
+   optimizations += [
+      ((op, ('bcsel', 'a', '#b', '#c'), '#d'),
+       ('bcsel', 'a', (op, 'b', 'd'), (op, 'c', 'd'))),
+      ((op, '#d', ('bcsel', a, '#b', '#c')),
+       ('bcsel', 'a', (op, 'd', 'b'), (op, 'd', 'c'))),
+   ]
+

Also made sure that hangs on "Total War: Shogun2" are gone without this commit (nir: Recognize (a < c || b < c) as min(a, b) < c.)
    http://cgit.freedesktop.org/mesa/mesa/commit/?id=f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff
Comment 5 Kenneth Graunke 2015-12-03 22:57:08 UTC
I wasn't able to reproduce this on my Broadwell GT2.  I tried both Shogun2.trace from this bug and "Empire: Total War".  I used Mesa master from today, and also tried 604ae33c8b95a97ba586780324566fd21c59b695.  Both worked OK for me - no hangs.
Comment 6 Kenneth Graunke 2015-12-03 22:57:24 UTC
Is this still happening for either of you?
Comment 7 pavel.e.popov 2015-12-04 12:14:10 UTC
We still have these GPU hangs in our environment. 

I used some common configuration to make sure that it's not an issue on our side. I tried Shogun2.trace and could reproduce GPU hang.

OS: Ubuntu 15.04
Kernel: 3.19.0-22-generic
GPU: Mesa DRI Intel(R) Iris Pro P6300 (Broadwell GT3e)
Mesa: 11.0.4~git20151026+11.0.ec14e6f8-0ubuntu0ricotz~vivid

Kenneth could you share your configuration?
Comment 8 Kenneth Graunke 2015-12-05 08:57:32 UTC
I'm using kernel 4.4.0-rc2 on a BDW GT2 (Lenovo X250) with Mesa 11.2.0-devel (git-cf97544).
Comment 9 pavel.e.popov 2015-12-07 13:06:49 UTC
Somehow this issue wasn't reproduced with Mesa master. But I'm not sure that this problem will not appear again.

I used 41e82f4f96f87e3b5bd3e7a3dc221cf6e6b6ae0b from Mesa master and couldn't reproduce hangs on all Total Wars.

Kenneth I didn't try Mesa up to 604ae33c8b95a97ba586780324566fd21c59b695 as you, I observed these hangs on Mesa 10.6 and found they are gone on my workloads if just one patch 604ae33c8b95a97ba586780324566fd21c59b695 is reverted. This is can be a reason why you didn't see hangs.

Looks like some order of NIR optimizations lead to hangs on BDW. For example, I used Mesa 10.6, reverted patch 604ae33c8b95a97ba586780324566fd21c59b695 and hangs were gone on my workloads Empire and Napoleon but when I also reverted patch f5cf74d8ba8ce30b9d53b2198e5122ed72f1dcff for Shogun2 and observed that hangs appeared again on Empire. I could hide all hangs only when I removed all NIR optimizations in nir_opt_algebraic.py (comment 2 is wrong, some optimizations weren't removed during that experiment).
Comment 10 Ian Romanick 2015-12-07 22:48:44 UTC
There's a lot of information in this bug, and I want to make sure I understand it all.

Pavel Ondračka can reproduce the hang in Total War: Shogun2 on Mesa 11.0.something (from commit c0722be9).

Pavel Popov can reproduce the hang in Total War: Napoleon and Total War: Empire on Mesa 11.0.4 (from commit ec14e6f8).

Pavel Popov can reproduce the hang in TW:N and TW:E on Mesa 10.6 (unknown commit).

Pavel Popov cannot reproduce the hang in TW:N or TW:E on Mesa 10.4 (unknown commit).

Ken cannot reproduce the hang in TW:E on Mesa 10.2 (commit 604ae33).

Ken cannot reproduce the hang in TW:E on Mesa master (unknown commit from 3-December-2015).

Pavel Popov cannot reproduce the hang in TW:N or TW:E on Mesa master (from commit 41e82f4f).

Reverting some patches (604ae33c8 or f5cf74d8, but *not both*) that add algebraic optimizations or disabling nir_opt_algebraic.py eliminates the hangs.
Comment 11 Ian Romanick 2015-12-07 22:53:35 UTC
Assuming that the information in comment #10 is at least mostly correct, I'd like to see several pieces of additional information:

1. Do the hangs occur on Mesa 11.0.6 or the current tip of the 11.0 stable branch?  I suspect that they will, but I want to be thorough.

2. If the hangs occur on Mesa 11.0.4 and do not occur on master, can someone bisect to see when this was fixed?  There may be some backend patch that we want to cherry pick back to 11.0.

3. Can someone attach the GEN assembly of the shaders that trigger the hang?  It sounds like running the Shogun2.trace trace with the environment variable INTEL_DEBUG=vs,gs,fs should do the trick.
Comment 12 pavel.e.popov 2015-12-08 13:59:07 UTC
I've managed to bisect the patch which fixes hangs for all Total Wars cases (TW:E, TW:N, TW:S):
    i965: always run the post-RA scheduler
    http://cgit.freedesktop.org/mesa/mesa/commit/?id=486268bdb03a36faf09d84e0458ff49dd1325c40

Also I applied this patch to Mesa 11.0.6 and made sure that hangs are gone.

Please double check me, I have a very specific environment which can affect results. I think that apitrace for Shogun2 is enough to do this.
Comment 13 pavel.e.popov 2015-12-08 14:06:14 UTC
Created attachment 120414 [details]
Shaders output with and without hangs

Used INTEL_DEBUG=vs,gs,fs for Shogun2 trace and obtained 2 logs: with (good one) and without (bad one) patch "i965: always run the post-RA scheduler".
Comment 14 Matt Turner 2016-11-30 01:44:14 UTC
I believe I came across a bug that is the cause of these problems.

I sent a four patch series

      i965/fs: Rename opt_copy_propagate -> opt_copy_propagation.
      i965/fs: Add unit tests for copy propagation pass.
      i965/fs: Reject copy propagation into SEL if not min/max.
      nir: Move fsat outside of fmin/fmax if second arg is 0 to 1.

where the 3rd is the bug fix.

My theory is that enabling NIR caused the code we optimize in the backend to hit this bug.
Comment 15 Matt Turner 2016-12-13 00:00:22 UTC
I've committed those four patches.

commit 7bed52bb5fb4cfd5f91c902a654b3452f921da17
Author: Matt Turner <mattst88@gmail.com>
Date:   Mon Nov 28 15:21:51 2016 -0800

    i965/fs: Reject copy propagation into SEL if not min/max.

and the previous three should fix a codegen bug seen in a number of games, including Shogun 2.

I would love it if we were able to confirm that this was the culprit, but it may not be possible.

Please do not hesitate to reopen if you can reproduce.


bug/show.html.tmpl processed on Feb 24, 2017 at 03:43:00.
(provided by the Example extension).