Bug 96731

Summary: [RADEONSI] [LLVM] [bisected] GPU lockups when running Alien: Isolation
Product: Mesa Reporter: Arek Ruśniak <arek.rusi>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: arsenm2, t.hirsch
Version: git   
Hardware: Other   
OS: All   
i915 platform: i915 features:
Attachments: gpu lockups part from dmesg
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273466
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273467
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273466
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273467

Description Arek Ruśniak 2016-06-29 19:06:59 UTC
Created attachment 124784 [details]
gpu lockups part from dmesg

Hi, GPU trying reset few times but hang at the end.

ArchLinux 64
Radeon HD 7770
mesa  latest from git
kernel 4.7rc
libdrm latest from git

first bad commit is:

r273467 | arsenm | 2016-06-22 22:15:28 +0200 |

AMDGPU: Fix verifier errors in SILowerControlFlow

The main sin this was committing was using terminator
instructions in the middle of the block, and then
not updating the block successors / predecessors.
Split the blocks up to avoid this and introduce new
pseudo instructions for branches taken with exec masking.

Also use a pseudo instead of emitting s_endpgm and erasing
it in the special case of a non-void return.
Comment 1 Michel Dänzer 2016-06-30 00:29:04 UTC
Matt, any ideas offhand?

Arek, can you attach the stderr output from running the game with the environment variable


with and without the commit in question?
Comment 2 Arek Ruśniak 2016-06-30 11:06:59 UTC
Created attachment 124796 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273466
Comment 3 Arek Ruśniak 2016-06-30 11:08:13 UTC
Created attachment 124797 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273467
Comment 4 Arek Ruśniak 2016-06-30 11:20:46 UTC
I didn't mention before but intro,loading screen and main menu works. Game hangs right after everything is loaded.
Comment 5 Matt Arsenault 2016-06-30 20:56:56 UTC
r274275 fixes a problem I noticed while doing more work on this, although I wouldn't expect it to change much
Comment 6 Matt Arsenault 2016-06-30 23:58:57 UTC
The only obvious difference I see in the dump diffs without looking at any particular shader is the number of used registers changed. This is probably because previously the implicit uses of the super registers were missing when the AsmPrinter counts them. If the dynamic was out of bounds, it is more likely to be out of bounds of the allocated VGPRs, in which case the hardware behavior is to return v0. If there are out of bounds accesses, it would now read an undefined register. I don't know if there are any actual  out of bounds dynamic vector accesses
Comment 7 Nicolai Hähnle 2016-07-01 13:33:42 UTC
There is nothing obviously wrong with the last shader(s) in the bad log - and unfortunately, the logs are not really comparable: the first genuine difference is in TGSI, which means that a different sequence of OpenGL calls happened in the two runs. This makes it basically impossible to figure out the problem.

To make progress on this bug, could you please record an apitrace of the game, and see if you can reproduce the lockups by playing back the trace? If this works, please provide

1. the trace file itself (e.g. upload on Google Drive)
2. before and after logs of playing back the trace like Michel asked for.
Comment 8 Arek Ruśniak 2016-07-02 08:50:34 UTC
Hi guys, replay causes gpu lockup as well. 
apitrace is here:
Comment 9 Arek Ruśniak 2016-07-02 08:55:57 UTC
Created attachment 124857 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273466
Comment 10 Arek Ruśniak 2016-07-02 08:58:31 UTC
Created attachment 124858 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273467
Comment 11 Nicolai Hähnle 2016-07-02 10:06:34 UTC
Hi Arek, thanks for the trace and new logs!

Looking at the logs, the only diff is in branch instructions. Perhaps there is a bug in how kill instructions are lowered now? Since there are several shaders with differences, it's not clear yet. I'm going to try to narrow it down to a single shader using the trace.
Comment 12 Nicolai Hähnle 2016-07-04 15:07:21 UTC
The first bug that I noticed in the shaders was in return handling for non-monolithic shader parts. Fix for that bug is here: http://reviews.llvm.org/D21975
Comment 13 Arek Ruśniak 2016-07-04 16:22:06 UTC
Nocolai, thanks for fix. That did the job. The game now looks even better:) 
Really! I'll try revert llvm to old revision and play it again, maybe it's just my imagination.
Comment 14 Michel Dänzer 2016-07-05 00:57:59 UTC
*** Bug 96794 has been marked as a duplicate of this bug. ***
Comment 15 Nicolai Hähnle 2016-07-06 08:58:04 UTC
Fixed in LLVM r274612 "AMDGPU: Fix return of non-void-returning shaders".
Comment 16 Marek Olšák 2016-07-06 10:01:14 UTC
(In reply to Michel Dänzer from comment #1)
> Matt, any ideas offhand?
> Arek, can you attach the stderr output from running the game with the
> environment variable
>  R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes
> with and without the commit in question?

For obtaining the hanging shader, setting GALLIUM_DDEBUG=800 and attaching the created log file is better. The issue would have been pretty obvious from that.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.