Bug 96731 - [RADEONSI] [LLVM] [bisected] GPU lockups when running Alien: Isolation
Summary: [RADEONSI] [LLVM] [bisected] GPU lockups when running Alien: Isolation
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
: 96794 (view as bug list)
Depends on:
Blocks:
 
Reported: 2016-06-29 19:06 UTC by Arek Ruśniak
Modified: 2016-07-06 10:01 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
gpu lockups part from dmesg (16.43 KB, text/plain)
2016-06-29 19:06 UTC, Arek Ruśniak
Details
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273466 (5.99 MB, application/octet-stream)
2016-06-30 11:06 UTC, Arek Ruśniak
Details
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273467 (6.00 MB, application/octet-stream)
2016-06-30 11:08 UTC, Arek Ruśniak
Details
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273466 (6.02 MB, application/octet-stream)
2016-07-02 08:55 UTC, Arek Ruśniak
Details
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273467 (6.02 MB, application/octet-stream)
2016-07-02 08:58 UTC, Arek Ruśniak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Arek Ruśniak 2016-06-29 19:06:59 UTC
Created attachment 124784 [details]
gpu lockups part from dmesg

Hi, GPU trying reset few times but hang at the end.

ArchLinux 64
Radeon HD 7770
mesa  latest from git
kernel 4.7rc
libdrm latest from git


first bad commit is:

r273467 | arsenm | 2016-06-22 22:15:28 +0200 |

AMDGPU: Fix verifier errors in SILowerControlFlow

The main sin this was committing was using terminator
instructions in the middle of the block, and then
not updating the block successors / predecessors.
Split the blocks up to avoid this and introduce new
pseudo instructions for branches taken with exec masking.

Also use a pseudo instead of emitting s_endpgm and erasing
it in the special case of a non-void return.
Comment 1 Michel Dänzer 2016-06-30 00:29:04 UTC
Matt, any ideas offhand?

Arek, can you attach the stderr output from running the game with the environment variable

 R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes

with and without the commit in question?
Comment 2 Arek Ruśniak 2016-06-30 11:06:59 UTC
Created attachment 124796 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273466
Comment 3 Arek Ruśniak 2016-06-30 11:08:13 UTC
Created attachment 124797 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273467
Comment 4 Arek Ruśniak 2016-06-30 11:20:46 UTC
I didn't mention before but intro,loading screen and main menu works. Game hangs right after everything is loaded.
Comment 5 Matt Arsenault 2016-06-30 20:56:56 UTC
r274275 fixes a problem I noticed while doing more work on this, although I wouldn't expect it to change much
Comment 6 Matt Arsenault 2016-06-30 23:58:57 UTC
The only obvious difference I see in the dump diffs without looking at any particular shader is the number of used registers changed. This is probably because previously the implicit uses of the super registers were missing when the AsmPrinter counts them. If the dynamic was out of bounds, it is more likely to be out of bounds of the allocated VGPRs, in which case the hardware behavior is to return v0. If there are out of bounds accesses, it would now read an undefined register. I don't know if there are any actual  out of bounds dynamic vector accesses
Comment 7 Nicolai Hähnle 2016-07-01 13:33:42 UTC
There is nothing obviously wrong with the last shader(s) in the bad log - and unfortunately, the logs are not really comparable: the first genuine difference is in TGSI, which means that a different sequence of OpenGL calls happened in the two runs. This makes it basically impossible to figure out the problem.

To make progress on this bug, could you please record an apitrace of the game, and see if you can reproduce the lockups by playing back the trace? If this works, please provide

1. the trace file itself (e.g. upload on Google Drive)
2. before and after logs of playing back the trace like Michel asked for.
Comment 8 Arek Ruśniak 2016-07-02 08:50:34 UTC
Hi guys, replay causes gpu lockup as well. 
apitrace is here:
https://drive.google.com/open?id=0Bx3qMdwakiQMaTNxd0JsazA4ejA
Comment 9 Arek Ruśniak 2016-07-02 08:55:57 UTC
Created attachment 124857 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273466
Comment 10 Arek Ruśniak 2016-07-02 08:58:31 UTC
Created attachment 124858 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273467
Comment 11 Nicolai Hähnle 2016-07-02 10:06:34 UTC
Hi Arek, thanks for the trace and new logs!

Looking at the logs, the only diff is in branch instructions. Perhaps there is a bug in how kill instructions are lowered now? Since there are several shaders with differences, it's not clear yet. I'm going to try to narrow it down to a single shader using the trace.
Comment 12 Nicolai Hähnle 2016-07-04 15:07:21 UTC
The first bug that I noticed in the shaders was in return handling for non-monolithic shader parts. Fix for that bug is here: http://reviews.llvm.org/D21975
Comment 13 Arek Ruśniak 2016-07-04 16:22:06 UTC
Nocolai, thanks for fix. That did the job. The game now looks even better:) 
Really! I'll try revert llvm to old revision and play it again, maybe it's just my imagination.
Comment 14 Michel Dänzer 2016-07-05 00:57:59 UTC
*** Bug 96794 has been marked as a duplicate of this bug. ***
Comment 15 Nicolai Hähnle 2016-07-06 08:58:04 UTC
Fixed in LLVM r274612 "AMDGPU: Fix return of non-void-returning shaders".
Comment 16 Marek Olšák 2016-07-06 10:01:14 UTC
(In reply to Michel Dänzer from comment #1)
> Matt, any ideas offhand?
> 
> Arek, can you attach the stderr output from running the game with the
> environment variable
> 
>  R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes
> 
> with and without the commit in question?

For obtaining the hanging shader, setting GALLIUM_DDEBUG=800 and attaching the created log file is better. The issue would have been pretty obvious from that.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.