Created attachment 124784 [details]
gpu lockups part from dmesg
Hi, GPU trying reset few times but hang at the end.
Radeon HD 7770
mesa latest from git
libdrm latest from git
first bad commit is:
r273467 | arsenm | 2016-06-22 22:15:28 +0200 |
AMDGPU: Fix verifier errors in SILowerControlFlow
The main sin this was committing was using terminator
instructions in the middle of the block, and then
not updating the block successors / predecessors.
Split the blocks up to avoid this and introduce new
pseudo instructions for branches taken with exec masking.
Also use a pseudo instead of emitting s_endpgm and erasing
it in the special case of a non-void return.
Matt, any ideas offhand?
Arek, can you attach the stderr output from running the game with the environment variable
with and without the commit in question?
Created attachment 124796 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273466
Created attachment 124797 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes ./AlienIsolation for r273467
I didn't mention before but intro,loading screen and main menu works. Game hangs right after everything is loaded.
r274275 fixes a problem I noticed while doing more work on this, although I wouldn't expect it to change much
The only obvious difference I see in the dump diffs without looking at any particular shader is the number of used registers changed. This is probably because previously the implicit uses of the super registers were missing when the AsmPrinter counts them. If the dynamic was out of bounds, it is more likely to be out of bounds of the allocated VGPRs, in which case the hardware behavior is to return v0. If there are out of bounds accesses, it would now read an undefined register. I don't know if there are any actual out of bounds dynamic vector accesses
There is nothing obviously wrong with the last shader(s) in the bad log - and unfortunately, the logs are not really comparable: the first genuine difference is in TGSI, which means that a different sequence of OpenGL calls happened in the two runs. This makes it basically impossible to figure out the problem.
To make progress on this bug, could you please record an apitrace of the game, and see if you can reproduce the lockups by playing back the trace? If this works, please provide
1. the trace file itself (e.g. upload on Google Drive)
2. before and after logs of playing back the trace like Michel asked for.
Hi guys, replay causes gpu lockup as well.
apitrace is here:
Created attachment 124857 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273466
Created attachment 124858 [details]
R600_DEBUG=fs,vs,gs,ps,cs,tcs,tes apitrace replay AlienIsolation.1.trace r273467
Hi Arek, thanks for the trace and new logs!
Looking at the logs, the only diff is in branch instructions. Perhaps there is a bug in how kill instructions are lowered now? Since there are several shaders with differences, it's not clear yet. I'm going to try to narrow it down to a single shader using the trace.
The first bug that I noticed in the shaders was in return handling for non-monolithic shader parts. Fix for that bug is here: http://reviews.llvm.org/D21975
Nocolai, thanks for fix. That did the job. The game now looks even better:)
Really! I'll try revert llvm to old revision and play it again, maybe it's just my imagination.
*** Bug 96794 has been marked as a duplicate of this bug. ***
Fixed in LLVM r274612 "AMDGPU: Fix return of non-void-returning shaders".
(In reply to Michel Dänzer from comment #1)
> Matt, any ideas offhand?
> Arek, can you attach the stderr output from running the game with the
> environment variable
> with and without the commit in question?
For obtaining the hanging shader, setting GALLIUM_DDEBUG=800 and attaching the created log file is better. The issue would have been pretty obvious from that.