Bug 107552

Summary: new handle table code makes dota2 vulkan crash at start from an empty shader cache
Product: DRI Reporter: Sylvain BERTRAND <sylvain.bertrand>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium    
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Possible fix none

Description Sylvain BERTRAND 2018-08-12 22:15:25 UTC
linux amd-staging-drm-next 7b20167b9670150698201b1d9e4bddd75ad60491 (today)
libdrm 879d7c0298d1d4bc52d71d599cc07cafb4645808 (today)
llvm ac6c5d8a36474ca9216ee2628d6a267a72a44d32 (today)
mesa de57926dc909b3fb180ff06a6c5235309fdbf4df (today)
xserver 1fc20b985cc888345bc8c6fce7b43f10ce71fe43 (today)
xf86-video-amdgpu 08c4d42f43f80baa4bbc2ff9d0a422202cdc3538 (today)

amd gpu tahiti xt

Empty the shader cache in $HOME/.cache, start dota2 vulkan then it crashes in a random location. Start it a second time, it works. Empty the shader cache, it will crash again at first start.
I reverted libdrm to the pre-handle table version, then everything is fine.
Comment 1 Christian König 2018-08-13 06:36:44 UTC
Well do you have a backtrace (with symbols)?

Otherwise it is really hard to guess what is going wrong here.
Comment 2 Sylvain BERTRAND 2018-08-13 18:52:14 UTC
I did a lot of testing and dumped cores (libsegfault.so or catchsegv do not
work with steam/dota2), the issue is actually random: sometimes it works, most
of the time it crashes or hangs the vulkan dota2 process or hangs the whole
system. Using the steam overlay with vulkan dota2 makes the crashes/hangs do
happen way more often. The segfault backtraces happen in various parts of
vulkan dota2 (I got very few of them in mesa vulkan driver) and do not show
anything in libdrm*. I even did manage to crash the new steam chat (it was
unable to restart itself). If I read a video (gl accelerated) at the same time,
it changes the way vulkan dota2 crashes/hangs.

The system is stable again if I switch back to the previous libdrm hash table
code.
Comment 3 Bas Nieuwenhuizen 2018-08-14 00:23:57 UTC
Even if you cannot get a backtrace, can you bisect to show the exact commit which causes the issue?
Comment 4 Sylvain BERTRAND 2018-08-14 01:44:02 UTC
Sure, but there are very few commits. Basically, it's almost directly the
commits related to the handle table:
87fdbfb62fb3de6759d465d07cc13f922084694e stable commit
879d7c0298d1d4bc52d71d599cc07cafb4645808 unstable commit
Comment 5 Sylvain BERTRAND 2018-08-14 14:18:19 UTC
bisected:
--------------------------------------------------------------------------------
commit cbf0bb7f192b814be84dff538fb90dacf65958c7
Author: Christian König <christian.koenig@amd.com>
Date:   Thu Aug 2 10:45:19 2018 +0200

    amdgpu: always add all BOs to handle table

    This way we can always find a BO structure by its handle.

    Signed-off-by: Christian König <christian.koenig@amd.com>
    Reviewed-by: Michel Dänzer <michel.daenzer@amd.com>
    Reviewed-and-Tested-by: Junwei Zhang <Jerry.Zhang@amd.com>
--------------------------------------------------------------------------------
I am currently running libdrm from the commit right before this one, no pb so
far.
Comment 6 Christian König 2018-08-15 11:57:54 UTC
Created attachment 141104 [details] [review]
Possible fix

The attached patch should fix the issue, please test.
Comment 7 Sylvain BERTRAND 2018-08-15 13:02:16 UTC
I applied on top of commit 1e12c16d7697a1223630a507c1032d940794039a

Stable so far.
Comment 8 Christian König 2018-08-16 06:55:34 UTC
Thanks for the report, sounds like we can close this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.