Bug 101925 - playstore/webview crash
Summary: playstore/webview crash
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Jason Ekstrand
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 101911
  Show dependency treegraph
 
Reported: 2017-07-26 11:50 UTC by Tapani Pälli
Modified: 2017-08-08 08:13 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
runtime workaround (1000 bytes, patch)
2017-07-27 12:10 UTC, Tapani Pälli
Details | Splinter Review
workaround that works! (1.82 KB, patch)
2017-08-01 10:31 UTC, Tapani Pälli
Details | Splinter Review
hopeful fix (1.29 KB, patch)
2017-08-03 07:44 UTC, Tapani Pälli
Details | Splinter Review

Description Tapani Pälli 2017-07-26 11:50:27 UTC
Crash happens when trying to login to google play store on android *OR* when trying to launch one of the 3DMark benchmarks (uses chrome webview as well)

What happens is that aux_surface is NULL in 'get_fast_clear_rect'.

backtrace:
--- 8< ---------------------
07-26 11:29:45.832  3126  3126 F DEBUG   : pid: 2951, tid: 3071, name: Chrome_InProcGp  >>> com.google.android.gms.ui <<<
07-26 11:29:45.832  3126  3126 F DEBUG   : signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x10
07-26 11:29:45.832  3126  3126 F DEBUG   :     rax 0000000100000001  rbx 0000000000000100  rcx 0000000000000000  rdx 000077230a5036c0
07-26 11:29:45.832  3126  3126 F DEBUG   :     rsi 0000000000000000  rdi 000077230a5030c0
07-26 11:29:45.832  3126  3126 F DEBUG   :     r8  0000000000000620  r9  0000000000000001  r10 0000000000000000  r11 0000000000000780
07-26 11:29:45.832  3126  3126 F DEBUG   :     r12 000077230a503808  r13 000077230a5037e0  r14 000077230a503808  r15 0000000000000000
07-26 11:29:45.832  3126  3126 F DEBUG   :     cs  0000000000000033  ss  000000000000002b
07-26 11:29:45.832  3126  3126 F DEBUG   :     rip 000077230df54c58  rbp 0000000000000780  rsp 000077230a5030a0  eflags 0000000000010246
07-26 11:29:45.840  3126  3126 F DEBUG   : 
07-26 11:29:45.840  3126  3126 F DEBUG   : backtrace:
07-26 11:29:45.840  3126  3126 F DEBUG   :     #00 pc 00000000004a2c58  /system/lib64/dri/i965_dri.so (blorp_fast_clear+168)
07-26 11:29:45.840  3126  3126 F DEBUG   :     #01 pc 0000000000060f78  /system/lib64/dri/i965_dri.so (brw_blorp_clear_color+1960)
07-26 11:29:45.840  3126  3126 F DEBUG   :     #02 pc 0000000000063e20  /system/lib64/dri/i965_dri.so (brw_clear+752)
07-26 11:29:45.840  3126  3126 F DEBUG   :     #03 pc 0000000000c7a1bf  /system/app/WebViewGoogle/WebViewGoogle.apk (offset 0xa6f000)
--- 8< ---------------------

bisected to this commit:
--- 8< ---------------------
commit f7f2fa8eb12ea60c1190ebf3c66901830150f4fd
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Thu Jun 15 22:58:49 2017 -0700

    i965/miptree: Rework aux enabling
    
    This commit replaces the complex and confusing set of disable flags with
    two fairly straightforward fields which describe the intended auxiliary
    surface usage and whether or not the miptree supports fast clears.
    Right now, supports_fast_clear can be entirely derived from aux_usage
    but that will not always be the case.
    
    This commit makes functional changes.  One of these changes is that it
    re-enables multisampled fast-clears which were accidentally disabled in
    cec30a666930ddb8476a9452a89364a24979ff62 around a year ago.  Fixing this
    improves the SynMark v7 DeferredAA test by around ~3% on some gen9
    hardware.  This commit also gets us closer to enabling CCS_E for
    window-system buffers which are Y-tiled.
    
    Reviewed-by: Chad Versace <chadversary@chromium.org>
    Reviewed-by: Topi Pohjolainen <topi.pohjolainen@intel.com>
Comment 1 Tapani Pälli 2017-07-26 11:52:01 UTC
Jason, let me know if you want some additional information. Any tips how to debug this appreciated!
Comment 2 Jason Ekstrand 2017-07-26 19:00:18 UTC
The first thing to do would be to run with a debug build of the driver so you can get a pepper backtrace.  Without knowing how the android bits tie into the miptree code, I couldn't say for sure.  My best guess is that someone is calling miptree_create_for_bo but not calling intel_miptree_alloc_aux.
Comment 3 Tapani Pälli 2017-07-27 05:17:01 UTC
Here's a bit more symbols, it originates from a glClear call.

---- 8< ----
07-27 05:14:51.060  3240  3240 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
07-27 05:14:51.060  3240  3240 F DEBUG   : Build fingerprint: 'AndroidIA/androidia_64/androidia_64:7.1.1/NMF26Q/tpalli07260826:eng/test-keys'
07-27 05:14:51.060  3240  3240 F DEBUG   : Revision: '0'
07-27 05:14:51.060  3240  3240 F DEBUG   : ABI: 'x86_64'
07-27 05:14:51.060  3240  3240 F DEBUG   : pid: 3072, tid: 3143, name: Chrome_InProcGp  >>> com.google.android.gms.ui <<<
07-27 05:14:51.060  3240  3240 F DEBUG   : signal 8 (SIGFPE), code 1 (FPE_INTDIV), fault addr 0x7ea19552a844
07-27 05:14:51.060  3240  3240 F DEBUG   :     rax 0000000000000000  rbx 00007ea195667dbc  rcx 0000000000000000  rdx 0000000000000000
07-27 05:14:51.060  3240  3240 F DEBUG   :     rsi 0000000000000000  rdi 0000000000000000
07-27 05:14:51.060  3240  3240 F DEBUG   :     r8  0000000000000000  r9  0000000000000000  r10 00007ea1b626fc70  r11 0000000000000246
07-27 05:14:51.060  3240  3240 F DEBUG   :     r12 00007ea191ae0ea8  r13 00007ea18d9094dc  r14 00007ea191ae0eb0  r15 00007ea191ae0eb4
07-27 05:14:51.060  3240  3240 F DEBUG   :     cs  0000000000000033  ss  000000000000002b
07-27 05:14:51.060  3240  3240 F DEBUG   :     rip 00007ea19552a844  rbp 00007ea191ae0e40  rsp 00007ea191ae0de0  eflags 0000000000010246
07-27 05:14:51.212  3240  3240 F DEBUG   : 
07-27 05:14:51.212  3240  3240 F DEBUG   : backtrace:
07-27 05:14:51.212  3240  3240 F DEBUG   :     #00 pc 0000000000725844  /system/lib64/dri/i965_dri.so (get_fast_clear_rect+404)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #01 pc 00000000007255d8  /system/lib64/dri/i965_dri.so (blorp_fast_clear+328)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #02 pc 00000000000615d6  /system/lib64/dri/i965_dri.so (do_single_blorp_clear+1542)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #03 pc 0000000000060fae  /system/lib64/dri/i965_dri.so (brw_blorp_clear_color+206)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #04 pc 000000000006671e  /system/lib64/dri/i965_dri.so (brw_clear+494)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #05 pc 0000000000318a32  /system/lib64/dri/i965_dri.so (_mesa_Clear+802)
07-27 05:14:51.212  3240  3240 F DEBUG   :     #06 pc 0000000000c7a1bf  /system/app/WebViewGoogle/WebViewGoogle.apk (offset 0xa6f000)
Comment 4 Tapani Pälli 2017-07-27 06:02:32 UTC
I'm not sure if this is necessarily related but I can reproduce one failure also on desktop that bisects to this commit, Piglit tests egl-create-pbuffer-surface and egl-create-largest-pbuffer-surface start to throw this assertion:

egl-create-pbuffer-surface: intel_mipmap_tree.c:1754: intel_miptree_alloc_ccs: Assertion `mt->aux_usage == ISL_AUX_USAGE_CCS_E || mt->aux_usage == ISL_AUX_USAGE_CCS_D' failed.

before this commit test passes fine.
Comment 5 Tapani Pälli 2017-07-27 11:20:13 UTC
I did some more debugging and here is my 'braindump' on the matter. I found out that both of Piglit and Playstore failures are likely related as they both happen with pbuffer usage where we have doubleBufferMode 0 (Visual in gl_framebuffer), so maybe something that is done during flush/invalidate (in swapbuffers) is now not done, or the clear logic does not understand single buffered in some manner? I could not figure out a place where I could set aux_usage as ISL_AUX_USAGE_NONE based on the Visual setting since that gets lost somewhere but may be present when mt is initially created (?)
Comment 6 Tapani Pälli 2017-07-27 12:10:11 UTC
Created attachment 133071 [details] [review]
runtime workaround

Here's a workaround that fixes the Playstore crash, here we check visual during clear and use _mesa_meta_glsl_Clear instead of blorp.
Comment 7 Tapani Pälli 2017-07-31 11:59:04 UTC
With further debugging I've noticed that during the crash in blorp_fast_clear, aux surface format field can have complete garbage in it, sometimes 0x0 and sometimes random number, for example '0xd1a72000' (which goes way too big). When it works this field has ISL_FORMAT_GEN9_CCS_32BPP (0x38b) on Joule. So .. I'll try to understand and hunt down how this format field lives.
Comment 8 Tapani Pälli 2017-08-01 10:31:00 UTC
Created attachment 133167 [details] [review]
workaround that works!

workaround that fixes both desktop and android issues (earlier workaround caused dEQP regression)
Comment 9 Jason Ekstrand 2017-08-02 20:59:16 UTC
I think this patch should fix it:

https://patchwork.freedesktop.org/patch/170058/
Comment 10 Tapani Pälli 2017-08-03 07:44:30 UTC
Created attachment 133222 [details] [review]
hopeful fix

Jason, what do you think of this one? This fixes the Piglit issue. I will test this also on Android.
Comment 11 Tapani Pälli 2017-08-03 07:48:54 UTC
yep, patch in comment #10 fixes issues seen both on android and desktop
Comment 12 Jason Ekstrand 2017-08-03 12:44:03 UTC
(In reply to Tapani Pälli from comment #10)
> Created attachment 133222 [details] [review] [review]
> hopeful fix
> 
> Jason, what do you think of this one? This fixes the Piglit issue. I will
> test this also on Android.

Ok, that's really weird.  That should already be handled by the mt->supports_fasrfclear check.  How is mt->aux_usage == NONE but mt->supports_fast_clear == true?  The only place can_fast_clear is set is in choose_auz_usage?  A watchpoint in Feb should be able to answer that question.
Comment 13 Tapani Pälli 2017-08-03 14:55:54 UTC
IIRC aux buffer creation can succeed first but then this 'make shareable' func later marks mt with aux usage NONE. I can try to verify this.
Comment 14 Jason Ekstrand 2017-08-03 16:54:30 UTC
(In reply to Tapani Pälli from comment #13)
> IIRC aux buffer creation can succeed first but then this 'make shareable'
> func later marks mt with aux usage NONE. I can try to verify this.

I was able to reproduce on Haswell with X11.  Yes, it was make_shareable that was the culpret though I still think the DRI2 path is a bit sketchy.  I've got a new patch on the list which will hopefully also fix the android issue.
Comment 15 Tapani Pälli 2017-08-08 08:13:19 UTC
commit e7a52cc381d275b4ab8ee2fb230e32ab97090daf
Author: Jason Ekstrand <jason.ekstrand@intel.com>
Date:   Thu Aug 3 09:48:55 2017 -0700

    i965/miptree: Set supports_fast_clear = false in make_shareable
    
    The make_shareable function deletes the aux buffer and then whacks
    aux_usage to ISL_AUX_USAGE_NONE but not unsetting supports_fast_clear.
    Since we only look at supports_fast_clear to decide whether or not to do
    fast clears, this was causing assertion failures.
    
    Reported-by: Tapani Pälli <tapani.palli@intel.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=101925
    Reviewed-by: Kenneth Graunke <kenneth@whitecape.org>
    Reviewed-by: Tapani Pälli <tapani.palli@intel.com>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.