Created attachment 136731 [details]
logcat with chrome app crash
while testing Android 8.1 build with latest mesa 17.4.0devel and libdrm 2.4.89
SIGSEGV MAPERR due to NULL pointer dereference happens with Chrome (and Firefox browser) apps, here is logcat extract for chrome
--------- beginning of crash
... F libc : Fatal signal 11 (SIGSEGV), code 1, fault addr 0x4 in tid 4851 (RenderThread), pid 4503 (.android.chrome)
... I crash_dump32: obtaining output fd from tombstoned, type: kDebuggerdTombstone
... I /system/bin/tombstoned: received crash request for pid 4503
... I crash_dump32: performing dump of process 4503 (target tid = 4851)
... F DEBUG : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
... F DEBUG : Build fingerprint: 'Android-x86/android_x86_64/x86_64:8.1.0/OPM1.171019.011/utente12162000:eng/test-keys'
... F DEBUG : Revision: '0'
... F DEBUG : ABI: 'x86'
... F DEBUG : pid: 4503, tid: 4851, name: RenderThread >>> com.android.chrome <<<
... F DEBUG : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4
... F DEBUG : Cause: null pointer dereference
... F DEBUG : eax 00000000 ebx cca69cf4 ecx 00000009 edx bfbff004
... F DEBUG : esi 00000000 edi 00000010
... F DEBUG : xcs 00000023 xds 0000002b xes 0000002b xfs 0000006b xss 0000002b
... F DEBUG : eip cc78d089 ebp bfbfefb8 esp bfbfefb8 flags 00010286
... F DEBUG :
... F DEBUG : backtrace:
... F DEBUG : #00 pc 00651089 /system/vendor/lib/dri/i965_dri.so (build_id_length+9)
... F DEBUG : #01 pc 00041ae2 /system/vendor/lib/dri/i965_dri.so (brw_program_binary_init+130)
... F DEBUG : #02 pc 0002dd50 /system/vendor/lib/dri/i965_dri.so (brwCreateContext+864)
... F DEBUG : #03 pc 003d7777 /system/vendor/lib/dri/i965_dri.so (driCreateContextAttribs+663)
... F DEBUG : #04 pc 000112a1 /android/system/vendor/lib/egl/libGLES_mesa.so
... F DEBUG : #05 pc 00004096 /android/system/vendor/lib/egl/libGLES_mesa.so
... F DEBUG : #06 pc 0000ef37 /android/system/lib/libEGL.so
Analysing the backtrace with addr2line the NULL pointer is retured at
utente@utente-Terabyte:~/oreo-x86_kernel_test$ addr2line -Cfe out/target/product/x86_64/symbols/system/vendor/lib/dri/i965_dri.so
external/mesa/src/util/build_id.c:117 <= here it is
utente@utente-Terabyte:~/oreo-x86_kernel_test$ addr2line -Cfe out/target/product/x86_64/symbols/system/vendor/lib/egl/libGLES_mesa.so
utente@utente-Terabyte:~/oreo-x86_kernel_test$ addr2line -Cfe out/target/product/x86_64/symbols/system/lib/libEGL.so
Recent changes in src/util/build_id.c may have removed some Android required code, or i965 needs to take into account the NULL case.
Thanks for feedback
I haven't yet found a proper fix bug here's a workaround you can consider to use:
The problem is with the address comparison check in build_id callback (build_id_find_nhdr_callback). This comparison fails on Android when running 32bit. One alternative would be to always add '0x8000' to dlpi_addr during comparison on 32bit but that seems a bit weird.
Regarding the length, I've been also considering if we are missing required build flags somewhere. We are setting it currently in these places:
src/intel/Android.vulkan.mk:LOCAL_LDFLAGS += -Wl,--build-id=sha1
Android.mk:MESA_DRI_LDFLAGS := -Wl,--build-id=sha1
It *should* be enough I think .. but maybe not.
Thanks a lot for sharing the workaround
It seams a regression introduced by 5c98d3825ccbed9054a1bb2de607116b2b31d48b "util: Query build-id by symbol address, not library name".
Is Chad Versace already having a look?
In the former coding of build_id_find_nhdr_callback() there was a comment:
- /* The first object visited by callback is the main program.
- * Android's libc returns a NULL pointer for the first executable.
- if (info->dlpi_name == NULL)
- return 0;
and NULL was checked; does last return 0 mean that nothing was done for Android, if Android libc returns systematically NULL?
In any case, I think code should also be robust to unconformant libraries and should not crash.
Just a question for my knowledge, where does the add '0x8000' to dlpi_addr during comparison on 32bit comes from, is it due to some "Android thing"?
(In reply to Mauro Rossi from comment #3)
> Hi Tapani,
> Thanks a lot for sharing the workaround
> It seams a regression introduced by 5c98d3825ccbed9054a1bb2de607116b2b31d48b
> "util: Query build-id by symbol address, not library name".
> Is Chad Versace already having a look?
As far as I know, no.
> In the former coding of build_id_find_nhdr_callback() there was a comment:
> - /* The first object visited by callback is the main program.
> - * Android's libc returns a NULL pointer for the first executable.
> - */
> - if (info->dlpi_name == NULL)
> - return 0;
> and NULL was checked; does last return 0 mean that nothing was done for
> Android, if Android libc returns systematically NULL?
This callback gets called for binary itself + libraries. This NULL check is not needed anymore since address check is better guard (if it worked ..). It works fine on desktop, both 32bit and 64bit. On Android it also works fine on 64bit.
> In any case, I think code should also be robust to unconformant libraries
> and should not crash.
Agreed, but if it's a linker bug I think then we want to get it fixed in bionic as well. Let's try to avoid duct-taping if possible.
> Just a question for my knowledge, where does the add '0x8000' to dlpi_addr
> during comparison on 32bit comes from, is it due to some "Android thing"?
That seems to be the amount of offset there is so that comparison will work. I haven't figured out why this is.
This commit should fix it properly: https://github.com/me176c-dev/android_external_mesa/commit/07b637bcef660b3bf12d6854153692026ef3c698
Can you test this patch, Mauro?
The main problem is that the build-id code doesn't handle the (unusual) case that there is an offset between the base address and the LOAD segment (as indicated by the virtual address in the ELF header). On Android 32-bit, the offset comes from the "relocation_packer" in the Android build system. It seems to re-pack the binary and adds the offset for whatever reason.
The ELF header tells us the offset so we just need to look through the header and add the offset to the base address (dlpi_addr) before comparing it to dli_fbase.
with the patch mentioned in comment #5
the problem is resolved, Chrome app and Firefox app are not crashing anymore.
Please proceed in having the fix merged in mesa 18.0 cycle
Pushed to master. It should find its way to the stable releases in due time.
Author: Stephan Gerhold <email@example.com>
Date: Wed Jan 24 15:13:24 2018 +0100
util/build-id: Fix address comparison for binaries with LOAD vaddr > 0