Bug 107351

Summary: Android 8.1: radv segfault with 3Dmark vulkan benchmarks
Product: Mesa Reporter: Mauro Rossi <issor.oruam>
Component: Drivers/Vulkan/radeonAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: mesa-dev
Severity: major    
Priority: medium    
Version: git   
Hardware: x86-64 (AMD64)   
OS: other   
Whiteboard:
i915 platform: i915 features:
Attachments: logcat of api_overhead test
logcat of slingshot extreeme vulkan test
extended logcat to see what happens before and after
Hack test case that avoids the segfault

Description Mauro Rossi 2018-07-23 19:45:45 UTC
Created attachment 140796 [details]
logcat of api_overhead test

Hi,

while performing some test on vulkan radv HAL on Android 8.1 (oreo-x86)
build a segfault was noticed and traced with 3Dmark vulkan benchmark.

The build is experimental and based on hwcomposer.drm + gralloc.gbm
and tested on HD7790 gpu (Bonaire)

The same 3Dmark vulkan benchmarks work without segfault on Intel anv running on Skylake GT2, so it is better to investigate why radv crashes.

Inspection of crash with addr2line -Cfe returns last line of code:
src/amd/vulkan/radv_image.c: 1289

Mauro Rossi
android-x86 team


--------- beginning of crash
07-22 21:31:15.938  4949  4995 F libc    : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x1c in tid 4995 (Thread-9), pid 4949 (cation:workload)
...
07-22 21:31:15.956  5007  5007 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
07-22 21:31:15.956  5007  5007 F DEBUG   : Build fingerprint: 'Android-x86/android_x86_64/x86_64:8.1.0/OPM6.171019.030.E1/maur07141051:userdebug/test-keys'
07-22 21:31:15.956  5007  5007 F DEBUG   : Revision: '0'
07-22 21:31:15.956  5007  5007 F DEBUG   : ABI: 'x86'
07-22 21:31:15.956  5007  5007 F DEBUG   : pid: 4949, tid: 4995, name: Thread-9  >>> com.futuremark.dmandroid.application:workload <<<
07-22 21:31:15.956  5007  5007 F DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x1c
07-22 21:31:15.956  5007  5007 F DEBUG   : Cause: null pointer dereference
07-22 21:31:15.956  5007  5007 F DEBUG   :     eax 00000000  ebx c65b690c  ecx c2a6d0ac  edx 00000097
07-22 21:31:15.956  5007  5007 F DEBUG   :     esi c2a6d0ac  edi c1fbbb00
07-22 21:31:15.956  5007  5007 F DEBUG   :     xcs 00000023  xds 0000002b  xes 0000002b  xfs 0000006b  xss 0000002b
07-22 21:31:15.956  5007  5007 F DEBUG   :     eip c64e0d47  ebp c2b7eb38  esp c2b7eaf0  flags 00010286
07-22 21:31:16.071  5007  5007 F DEBUG   : 
07-22 21:31:16.071  5007  5007 F DEBUG   : backtrace:
07-22 21:31:16.071  5007  5007 F DEBUG   :     #00 pc 000f1d47  /system/vendor/lib/hw/vulkan.radv.so (radv_image_create+199)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #01 pc 000f2f26  /system/vendor/lib/hw/vulkan.radv.so (radv_CreateImage+166)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #02 pc 0000b59a  /android/system/lib/libvulkan.so
Comment 1 Mauro Rossi 2018-07-23 19:49:01 UTC
Created attachment 140797 [details]
logcat of slingshot extreeme vulkan test
Comment 2 Bas Nieuwenhuizen 2018-07-23 20:27:46 UTC
Do you have a hash of the upstream git version you based this on? (just so I can see how the lines correlate in radv_image.c)
Comment 3 Mauro Rossi 2018-07-23 21:11:40 UTC
Hi Bas, 
the problem was obeserved in the last month, but the relevant parts radv_image.c and radv_android.c did not change,
so anything between 3da693b and HEAD

the addr2line output is not always reliable in terms of line of code,
looking at the logcat the call causing the segfault is radv_image_create() at line 1261, the symbols resolved by crash dump should be reliable,
based on previous experiences.

I had a look at differences between radv and anv and I saw two of them:

1. wsi extension implementation is different, so I tried to revert 
   a50f93e (radv/image: Implement the wsi "extension")
   but the problem was not solved

2. radv_image_create() passes .scanout parameter and anv_image_create()
   does not. I have not tried yet to remove .scanout in radv_image_create(), 
   but is it worth an attempt?

Mauro
Comment 4 Bas Nieuwenhuizen 2018-07-23 21:29:42 UTC
No, on ANDROID, the call to radv_image_from_gralloc should happen for wsi images and wsi_info should be always NULL.

Is ANDROID properly defined?
Comment 5 Mauro Rossi 2018-07-23 21:32:33 UTC
>Is ANDROID properly defined?

Yes, because it is defined by the build system itself, 
without need of passing the LOCAL_CFLAGS := -DANDROID

Mauro
Comment 6 Mauro Rossi 2018-07-23 21:50:50 UTC
Created attachment 140802 [details]
extended logcat to see what happens before and after

extended logcat to see what happens before and after
Comment 7 Bas Nieuwenhuizen 2018-07-23 22:25:10 UTC
Looking at the full backtrace it looks like plain texture creation, nothing Android specific so I'll need to debug the specific app sometime.

07-22 21:31:16.071  5007  5007 F DEBUG   : backtrace:
07-22 21:31:16.071  5007  5007 F DEBUG   :     #00 pc 000f1d47  /system/vendor/lib/hw/vulkan.radv.so (radv_image_create+199)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #01 pc 000f2f26  /system/vendor/lib/hw/vulkan.radv.so (radv_CreateImage+166)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #02 pc 0000b59a  /android/system/lib/libvulkan.so
07-22 21:31:16.071  5007  5007 F DEBUG   :     #03 pc 001488fe  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::graphics::vulkan::native_memory_utility::create_texture(VkImageCreateInfo const&)+78)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #04 pc 0013ffe9  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::graphics::vulkan::vulkan_device::create_texture(eva::graphics::texture_desc const&, eva::graphics::subresource_data const*)+233)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #05 pc 001458cf  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so
07-22 21:31:16.071  5007  5007 F DEBUG   :     #06 pc 0013e474  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::graphics::vulkan::create_dds_texture_from_memory(std::function<eva::graphics::native_texture (eva::graphics::texture_desc const&, eva::graphics::subresource_data const*)>, eva::graphics::texture_desc*, unsigned char const*, DXGI_FORMAT, unsigned int, unsigned int, unsigned int)+2484)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #07 pc 00145b8c  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::graphics::vulkan::vulkan_resource_factory::load_texture(boost::filesystem::path const&, DXGI_FORMAT, int)+316)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #08 pc 001ad794  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::graphics::assets::texture_loader::load(std::string const&) const+1412)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #09 pc 00153e69  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (eva::ui::loading_screen_contents::loading_screen_contents(eva::ui::loading_screen_creation_parameters const&, eva::globals&)+649)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #10 pc 0011f1a1  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (mobile_player::player::create_loading_screen()+161)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #11 pc 00125563  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (_ZN13mobile_player6playerC1ERN3eva7globalsESsNS1_4math6vectorIiLj2EEESsRNS1_2ui13invalidatableERNS7_12message_pumpEP13ANativeWindowiNS1_8graphics8platformE+6083)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #12 pc 00118e37  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (android_native::native_application::create_player(std::string, bool, ANativeWindow*)+375)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #13 pc 0011cd0e  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (activity::handle_cmd(int)+206)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #14 pc 0011ce15  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (activity::handle_cmd(android_app*, int)+37)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #15 pc 0025e46c  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so
07-22 21:31:16.071  5007  5007 F DEBUG   :     #16 pc 0011cb9e  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (activity::run()+126)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #17 pc 0011cf02  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so (android_main+98)
07-22 21:31:16.071  5007  5007 F DEBUG   :     #18 pc 0025ddd4  /data/app/com.futuremark.dmandroid.application--TcMIPU5ttg_gGGrKxrVOQ==/lib/x86/lib3dmark_farandole.so
Comment 8 Mauro Rossi 2018-07-23 22:46:30 UTC
In the code between ANDROID braces, is it correct that code falls through in case of !gralloc_info or should it bail out?

Just a question as I may be very wrong and confused


#ifdef ANDROID
   const VkNativeBufferANDROID *gralloc_info =
      vk_find_struct_const(pCreateInfo->pNext, NATIVE_BUFFER_ANDROID);

if (gralloc_info)    /*  this line 1252  */
   return radv_image_from_gralloc(device, pCreateInfo, gralloc_info,
		                              pAllocator, pImage);
#endif
Comment 9 Bas Nieuwenhuizen 2018-07-23 23:36:37 UTC
Yes, images not explicitly using the path for Android cross-api interop should just follow the normal path.
Comment 10 Bas Nieuwenhuizen 2018-07-23 23:38:30 UTC
If you're in a position to get more useful info, it would be great if you could dump the *pCreateInfo arg of the crashing radv_CreateImage.
Comment 11 Mauro Rossi 2018-07-26 23:16:44 UTC
Hi Bas, 
thanks for your instructions

I've tracked the line crashing in the radv_image.c file in master branch

utente@utente-MS-7576:~/oreo-x86_hwc1$ addr2line -Cfe out/target/product/x86_64/symbols/system/vendor/lib/hw/vulkan.radv.so 
000f3407
vk_format_get_nr_components
external/mesa/src/amd/vulkan/vk_format.h:535
000f45e6
radv_CreateImage
external/mesa/src/amd/vulkan/radv_image.c:1261
^C

I don't trust too much addr2lib when the most recent code line is an header,
so I added ALOGD lines with LOG_TAG RADV 
in order to have the incremental inspection of 
radv_image_from_gralloc() arguments to expose the
argument affected by null pointer, here is what I did:


From f2db3b5a70aaf66e9bf60c42da99560e6e5b6c13 Mon Sep 17 00:00:00 2001
From: Mauro Rossi <issor.oruam@gmail.com>
Date: Thu, 26 Jul 2018 23:48:54 +0200
Subject: [PATCH] DO NOT MERGE: add incremental dump of
 radv_image_from_gralloc() arguments

Looking at the line crashing we will identity the argument with null pointer
---
 src/amd/vulkan/radv_image.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/src/amd/vulkan/radv_image.c b/src/amd/vulkan/radv_image.c
index 826f898d28..b1141c71e2 100644
--- a/src/amd/vulkan/radv_image.c
+++ b/src/amd/vulkan/radv_image.c
@@ -25,6 +25,9 @@
  * IN THE SOFTWARE.
  */
 
+#define LOG_TAG "RADV"
+
+#include <cutils/log.h>
 #include "radv_debug.h"
 #include "radv_private.h"
 #include "vk_format.h"
@@ -1248,7 +1251,11 @@ radv_CreateImage(VkDevice device,
 #ifdef ANDROID
 	const VkNativeBufferANDROID *gralloc_info =
 		vk_find_struct_const(pCreateInfo->pNext, NATIVE_BUFFER_ANDROID);
-
+	ALOGD("radv_image_from_gralloc() device=0x%x", device);
+	ALOGD("radv_image_from_gralloc() *pCreateInfo=0x%x", *pCreateInfo);
+	ALOGD("radv_image_from_gralloc() gralloc_info=0x%x", gralloc_info);
+	ALOGD("radv_image_from_gralloc() *pAllocator=0x%x", *pAllocator);
+	ALOGD("radv_image_from_gralloc() *pImage=0x%x", *pImage);
 	if (gralloc_info)
 		return radv_image_from_gralloc(device, pCreateInfo, gralloc_info,
 		                              pAllocator, pImage);
-- 
2.17.1


The output in the logcat is:

... D RADV    : radv_image_from_gralloc() device=
... D RADV    : radv_image_from_gralloc() *pCreateInfo=0xe
... D RADV    : radv_image_from_gralloc() gralloc_info=0xc502a0e8
... F libc    : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x10 in tid 5548 (Thread-9), pid 5501 (cation:workload)

The crash happened with pAllocator, so this argument is the null pointer.
Mauro
Comment 12 Bas Nieuwenhuizen 2018-07-27 00:03:02 UTC
Note that *pCreateInfo and *gralloc_info are structs, not integers so printing them this way is not very useful, since you are missing most of the data (also gralloc_info is a pointer, printing that is not very useful either).

pAllocator is allowed to be NULL and it likely does not really matter for this issue.
Comment 13 Mauro Rossi 2018-07-27 11:06:44 UTC
Hi,

but the sequence of ALOGD customized LOG_TAG RADV log does tell that the line invoking pAllocator argument was the first not being executed, and thus is involved in the segfault,

could you please give an advice on how to modify the ALOGD arguments in a way it may be more useful?
Thanks
Mauro
Comment 14 Mauro Rossi 2018-07-30 01:21:29 UTC
Created attachment 140879 [details] [review]
Hack test case that avoids the segfault

Hi Bas,

I (truly) don't know how this was possible, because it is all based on trial and
error, but the changes in the attachment allow to avoid the segfault and to proceed in both 3DMARK Slingshot Extreme and 3DMARK API Overhead,
the benchmark now proceed without segfaults.

The changes are inspired from anv code with some changes as explained in the commit message.

Probably the problem is elsewhere, I'd like to understand why these changes work.

Mauro
Comment 15 Mauro Rossi 2018-09-11 06:31:37 UTC
Hi Bas,
the problem is not happening anymore,
in both mesa 18.2 and 18.3 branches.
The bug report can be closed.
Mauro
Comment 16 Samuel Pitoiset 2018-09-14 12:48:02 UTC
Thanks for letting us know. Closing!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.