Bug 98028

Summary: Guns of Icarus Online segfaults on startup since AMDGPU: Partially fix control flow at -O0
Product: Mesa Reporter: Daniel Scharrer <daniel>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: arsenm2, daniel
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: Short Guns of Icarus Online startup trace (truncated by segfault)
Backtraces recorded for the crashes
R600_DEBUG=vs,tcs,tes,gs,ps,cs log
Another R600_DEBUG=vs,tcs,tes,gs,ps,cs log
Valgrind log

Description Daniel Scharrer 2016-10-03 12:58:30 UTC
Created attachment 126970 [details]
Short Guns of Icarus Online startup trace (truncated by segfault)

Guns of Icarus Online segfaults (or sometimes hangs) on startup with current Mesa and LLVM. I have bisected the segfault to LLVM r282667.

The backtraces for the segfaults vary. Some of the segfaults are inside malloc / free, indicating possible memory corruption.

I have attached an apitrace recorded using a bad LLVM revision. While the game consistently segfaults (or hangs), replaying the trace does not result in a segfault every time.

Here is also a longer trace of the full startup sequence recorded using a good LLVM revision:
 http://constexpr.org/tmp/GoIO-radeonsi.2.trace.xz (82 MiB)

GPU: R9 380X
Kernel: 4.7.5-gentoo
Mesa: git-024c207
LLVM: r283076
Comment 1 Michel Dänzer 2016-10-04 08:25:42 UTC
Please attach a backtrace of a segfault.
Comment 2 Daniel Scharrer 2016-10-04 13:48:31 UTC
Created attachment 126992 [details]
Backtraces recorded for the crashes

Here is a list of backtraces I have seen - it's probably not complete.
Comment 3 Nicolai Hähnle 2016-10-04 15:34:11 UTC
I haven't been able to reproduce this with Mesa master and LLVM r283219 so far. Does this happen with clean re-builds?

If this still happens with current LLVM and clean re-builds, please provide logs with R600_DEBUG=vs,tcs,tes,gs,ps,cs.

The wide range of different backtraces suggests that it might be random memory corruption, so running under Valgrind may also be worth a shot.
Comment 4 Daniel Scharrer 2016-10-04 18:22:53 UTC
Created attachment 127001 [details]
R600_DEBUG=vs,tcs,tes,gs,ps,cs log

(In reply to Nicolai Hähnle from comment #3)
> I haven't been able to reproduce this with Mesa master and LLVM r283219 so
> far. Does this happen with clean re-builds?

Yes, all LLVM and Mesa builds were done through the package manager, starting with an empty build directory. And I don't use ccache.

I just re-checked with an updated LLVM & Mesa and the game still crashes:
Mesa: git-0e85ff3
LLVM: r283225

I also verified that it still starts properly with amdgpu-pro (running on top of the upstream 4.7.5 amdgpu module).

> If this still happens with current LLVM and clean re-builds, please provide
> logs with R600_DEBUG=vs,tcs,tes,gs,ps,cs.
> 
> The wide range of different backtraces suggests that it might be random
> memory corruption, so running under Valgrind may also be worth a shot.

It does look like it. I'll get a valgrind memcheck log, but will first need to recompile a couple of libraries because valgrind still doesn't support all the instructions of my CPU :/
Comment 5 Daniel Scharrer 2016-10-04 18:24:00 UTC
Created attachment 127002 [details]
Another R600_DEBUG=vs,tcs,tes,gs,ps,cs log

Looks like the crashes don't always happen for the same shader.
Comment 6 Daniel Scharrer 2016-10-04 22:51:12 UTC
Created attachment 127008 [details]
Valgrind log

I managed to get a Valgrind log, the backtrace of the first invalid read seems consistent.

Here are the options I used, let me know if you want me to try any others:

 valgrind --tool=memcheck --error-limit=no --log-file=valgrind-%p.log -v --trace-children=yes --track-origins=yes  --read-var-info=yes --redzone-size=1024 --

I also noticed that the game's engine (Unity) overrides operator new and friends, maybe that's involved somehow.
Comment 7 Nicolai Hähnle 2016-10-05 13:20:19 UTC
Thanks for the additional info. Running llc on those shaders under Valgrind doesn't show anything either, but this may be a limitation of Valgrind in connection with LLVM's internal allocator.

That this is exposed by the game's operator overrides is curious. If the bisection result is solid, we can't put the blame on those overrides though.
Comment 8 Nicolai Hähnle 2016-10-06 08:44:03 UTC
Careful inspection of the commit you bisected this to has lead me to a smoking gun. Could you please check whether the patch at https://reviews.llvm.org/D25306 fixes this for you?
Comment 9 Daniel Scharrer 2016-10-06 14:22:18 UTC
Your patch from D25306 fixes the crash for me. Thanks for looking into this.
Comment 10 Nicolai Hähnle 2016-10-07 08:50:03 UTC
Fixed in LLVM r283528.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.