Created attachment 140804 [details] DMESG LOG AnTuTu benchmark is one popular app for Android smartphone and tablet benchmarking! Recently, when lunched Antutu 7.x version on Android IA and Chrome OS, we encountered GPU hang issue. [ 348.713449] [drm] GPU HANG: ecode 9:0:0x84df9ffc, in Thread-10 [8813], reason: hang on rcs0, action: reset [ 348.713453] [drm:drm_ioctl] pid=3367, dev=0xe200, auth=1, DRM_IOCTL_WAIT_VBLANK [ 348.713455] [drm:drm_wait_vblank_ioctl] waiting on vblank count 20853, crtc 0 [ 348.713485] [drm:i915_reset_device] resetting chip [ 348.713500] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 348.713825] [drm:i915_gem_reset_engine] context Thread-10[8813]/1 marked guilty (score 10) banned? no [ 348.713827] [drm:i915_gem_reset_engine] resetting rcs0 to restart from tail of request 0x413a [ 348.713856] [drm] RC6 on [ 348.714048] [drm:gen8_init_common_ring] Execlists enabled for rcs0 [ 348.714055] [drm:init_workarounds_ring] rcs0: Number of context specific w/a: 15
Created attachment 140805 [details] /sys/class/drm/card0/error Upload /sys/class/drm/card0/error This issue could not be reproduced with mesa 17.2 on Chrome OS. Since 17.3, we could see GPU hang and app crash. Some adblog may be helpful. 07-24 03:31:35.914 9130 9130 F DEBUG : #08 pc 00000000000a262a /vendor/lib64/dri/i965_dri.so (_intel_batchbuffer_flush_fence+2522) 07-24 03:31:35.914 9130 9130 F DEBUG : #09 pc 0000000000082ed6 /vendor/lib64/dri/i965_dri.so (brw_draw_prims+1462) 07-24 03:31:35.914 9130 9130 F DEBUG : #10 pc 000000000057bbb8 /vendor/lib64/dri/i965_dri.so (vbo_exec_DrawElements+440)
I'm able to reproduce this and have tried to debug it for some time. My Android stack is a bit old (4.14.35 kernel) but I've rebased Mesa patches on top of recent Mesa commit b3b170a and tested, hang still occurs. Also tried with "intel/ppgtt: memory address alignment" fix but that did not make any difference.
I noticed that some shaders fail to load (?) which indicates possible issue within the application itself: --- 8< --- 07-26 06:40:09.773 9704 9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\Textshader.hlsl_main_vs_vp 07-26 06:40:09.781 9704 9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\Textshader.hlsl_main_ps_fp 07-26 06:40:09.781 9704 9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\SpriteShader_vs.hlsl_main_vp 07-26 06:40:09.781 9704 9748 D RavnStudio: File not found /sdcard/androiddata/shaders/SpriteShader_vs.hlsl_main_vp 07-26 06:40:09.781 9704 9748 D RavnStudio: **shader /sdcard/androiddata/shaders\SpriteShader_vs.hlsl_main_vp failed to load! 07-26 06:40:09.781 9704 9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\SpriteShader_ps.hlsl_main_fp 07-26 06:40:09.781 9704 9748 D RavnStudio: File not found /sdcard/androiddata/shaders/SpriteShader_ps.hlsl_main_fp 07-26 06:40:09.781 9704 9748 D RavnStudio: **shader /sdcard/androiddata/shaders\SpriteShader_ps.hlsl_main_fp failed to load! 07-26 06:40:09.781 9704 9748 D RavnStudio: Linking incomplete shaders 07-26 06:40:09.781 9704 9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\ScreenQuadShader_vs.hlsl_main_vp 07-26 06:40:09.781 9704 9748 D RavnStudio: File not found /sdcard/androiddata/shaders/ScreenQuadShader_vs.hlsl_main_vp 07-26 06:40:09.781 9704 9748 D RavnStudio: **shader /sdcard/androiddata/shaders\ScreenQuadShader_vs.hlsl_main_vp failed to load! 07-26 06:40:09.781 9704 9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\ScreenQuadShader_ps.hlsl_main_fp 07-26 06:40:09.781 9704 9748 D RavnStudio: File not found /sdcard/androiddata/shaders/ScreenQuadShader_ps.hlsl_main_fp 07-26 06:40:09.781 9704 9748 D RavnStudio: **shader /sdcard/androiddata/shaders\ScreenQuadShader_ps.hlsl_main_fp failed to load! 07-26 06:40:09.781 9704 9748 D RavnStudio: Linking incomplete shaders --- 8< ---
Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with 17.3 and later.
(In reply to Ren Chenglei from comment #4) > Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As > the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with > 17.3 and later. Maybe not the root cause but I don't feel very well debugging app that fails to load shaders or other assets, it means app is not working the way it was designed. I also saw following files fail to load: --- 8< --- File not found /sdcard/androiddata/level/level/chunks_astc.cnk File not found /sdcard/androiddata/level/level/chunks.cnk --- 8< ---
(In reply to Tapani Pälli from comment #5) > (In reply to Ren Chenglei from comment #4) > > Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As > > the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with > > 17.3 and later. > > Maybe not the root cause but I don't feel very well debugging app that fails > to load shaders or other assets, it means app is not working the way it was > designed. I also saw following files fail to load: > > --- 8< --- > File not found /sdcard/androiddata/level/level/chunks_astc.cnk > File not found /sdcard/androiddata/level/level/chunks.cnk > --- 8< --- + File not found /sdcard/androiddata/level/level/chunkpaths.bin
Problem seems to be that we fail to submit batchbuffer: i965: Failed to submit batchbuffer: I/O error This is considered critical and driver exits. By ignoring these failures and just printing errors to the log, I got 'Coastline' and rest of the benchmark running.
(In reply to Tapani Pälli from comment #7) > Problem seems to be that we fail to submit batchbuffer: > > i965: Failed to submit batchbuffer: I/O error > > This is considered critical and driver exits. By ignoring these failures and > just printing errors to the log, I got 'Coastline' and rest of the benchmark > running. Having said that, the I/O error probably happens because of the hang (context gets banned) so I'm just working around the actual issue which is the gpu hang.
Based on the investigations on error_state it seems that somehow we possibly end up violating following rule from PRM: "The 3DSTATE_CONSTANT_VS must be reprogrammed prior to the next 3DPRIMITIVE command after programming the 3DSTATE_PUSH_CONSTANT_ALLOC_VS" In some cases it seems we do ALLOC and then 3DPRIMITIVE without programming constant state first.
I don't see how that can happen, we run the atom which emits 3DSTATE_CONSTANT_* on every draw call, based on push_constants_dirty bits which are set to true every time we emit 3DSTATE_PUSH_CONSTANT_ALLOC_VS. The code even cites that exact rule.
(In reply to Kenneth Graunke from comment #10) > I don't see how that can happen, we run the atom which emits > 3DSTATE_CONSTANT_* on every draw call, based on push_constants_dirty bits > which are set to true every time we emit 3DSTATE_PUSH_CONSTANT_ALLOC_VS. > The code even cites that exact rule. Maybe this is some kind of interaction with Blorp. Maybe adding BRW_NEW_BLORP to brw_tracked_state gen7_push_constant_space ? In Anv we reemit the push constant allocation after a blorp set of commands.
Created attachment 140865 [details] [review] i965: reemit push constant urb allocation after blorp Let me know if that makes sense/works
(In reply to Lionel Landwerlin from comment #12) > Created attachment 140865 [details] [review] [review] > i965: reemit push constant urb allocation after blorp > > Let me know if that makes sense/works I have tried the patch, but it still not work. The APP crashes again.
(In reply to Ren Chenglei from comment #13) > (In reply to Lionel Landwerlin from comment #12) > > Created attachment 140865 [details] [review] [review] [review] > > i965: reemit push constant urb allocation after blorp > > > > Let me know if that makes sense/works > > I have tried the patch, but it still not work. The APP crashes again. Thanks, could you upload the error state with when running with that patch?
Created attachment 140896 [details] Dmesg log with patch 140865
(In reply to Ren Chenglei from comment #15) > Created attachment 140896 [details] > Dmesg log with patch 140865 Sorry, that's dmesg, not the error state (/sys/class/drm/card0/error).
Created attachment 140897 [details] Logcat with patch 140865
(In reply to Lionel Landwerlin from comment #16) > (In reply to Ren Chenglei from comment #15) > > Created attachment 140896 [details] > > Dmesg log with patch 140865 > > Sorry, that's dmesg, not the error state (/sys/class/drm/card0/error). So sorry for that, could I update it tomorrow, as I have went back and device has been closed.
Sure
Created attachment 140904 [details] error state with patch 140865
Do we have any progress on this crash issue?
(In reply to Ren Chenglei from comment #21) > Do we have any progress on this crash issue? I haven't been able to figure out the hang. Some observations though .. I've disabled EXT_tessellation_shader in the driver, this causes antutu to jump over the first test and run rest of the tests and there is no hang. I've also tried disabling usage of push constants (hang seems to be related to these) but this did not help, hang still occurs. Next thing I'm going to try is to dump all shaders from first test and see if I could figure out which ones might be related to the hang.
(In reply to Tapani Pälli from comment #22) > (In reply to Ren Chenglei from comment #21) > > Do we have any progress on this crash issue? > > I haven't been able to figure out the hang. Some observations though .. I've > disabled EXT_tessellation_shader in the driver, this causes antutu to jump > over the first test and run rest of the tests and there is no hang. I've > also tried disabling usage of push constants (hang seems to be related to > these) but this did not help, hang still occurs. Next thing I'm going to try > is to dump all shaders from first test and see if I could figure out which > ones might be related to the hang. Thanks Tapani for the updates! I tried following: diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c index df999ffeb1d0..325f28b35e60 100644 --- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c +++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c @@ -823,7 +823,7 @@ submit_batch(struct brw_context *brw, int in_fence_fd, int *out_fence_fd) if (ret != 0) { fprintf(stderr, "i965: Failed to submit batchbuffer: %s\n", strerror(-ret)); - exit(1); + //exit(1); } return ret; If we don't exit, after GPU hang, it will continue the rest test. But the first part test is still black.
(In reply to Tapani Pälli from comment #22) > (In reply to Ren Chenglei from comment #21) > > Do we have any progress on this crash issue? > > I haven't been able to figure out the hang. Some observations though .. I've > disabled EXT_tessellation_shader in the driver, this causes antutu to jump > over the first test and run rest of the tests and there is no hang. I've > also tried disabling usage of push constants (hang seems to be related to > these) but this did not help, hang still occurs. Next thing I'm going to try > is to dump all shaders from first test and see if I could figure out which > ones might be related to the hang. I tried on Pixel 2 and the first test could work fine. But we still could get shaders loading fail when we use Pixel 2. Please refer to attachment pixel2_adblog.txt.
Created attachment 141058 [details] Logcat with Pixel2
Maybe I caught the root cause of the GPU Hang issue: diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c index 58bba208db10..fa871473ca14 100644 --- a/src/mesa/vbo/vbo_exec_array.c +++ b/src/mesa/vbo/vbo_exec_array.c @@ -1001,19 +1003,24 @@ vbo_exec_DrawElements(GLenum mode, GLsizei count, GLenum type, FLUSH_FOR_DRAW(ctx); if (_mesa_is_no_error_enabled(ctx)) { - _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); + //The following function will cause screen black + //_mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); if (ctx->NewState) _mesa_update_state(ctx); } else { - _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); + //The following function will cause screen black + //_mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); if (!_mesa_validate_DrawElements(ctx, mode, count, type, indices)) return; } - vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, 0, ~0, + if (mode != GL_PATCHES) + vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, 0, ~0, count, type, indices, 0, 1, 0); + else + ALOGD("mesa - log - skip draw as GL_PATCHES cause GPU hang"); } Tapani, do we have support GL_PATCHES in mesa, which is only available with GLES 3.2 or greater. https://www.khronos.org/registry/OpenGL-Refpages/es3/ When the mode is GL_PATCHES, we skip the draw, and the GPU hang can't be reproduced. BTW, the screen will be black in the first part test, I commented function: _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); It will be better, but there is still some issue.
(In reply to Ren Chenglei from comment #26) > Tapani, do we have support GL_PATCHES in mesa, which is only available with > GLES 3.2 or greater. > https://www.khronos.org/registry/OpenGL-Refpages/es3/ Yes, we do support GL_PATCHES which is used with tessellation. This is good hint though, we could perhaps trace the actual frame that causes the hang .. unfortunately currently we don't have a way to take GL traces on Android :/ But we could dump the shaders at point when such draw call arrives. > When the mode is GL_PATCHES, we skip the draw, and the GPU hang can't be > reproduced. > > BTW, the screen will be black in the first part test, I commented function: > _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx)); > It will be better, but there is still some issue. Yep we can't really skip these calls, they are necessary so that we have correct VAO in place.
Hi Tapani, are you still working on this? we still reproduced this issue in Mesa 19.0.6 in Android kernel 4.19, attached the latest hang error state.
Created attachment 144526 [details] GPU HANG error state on Mesa 19.0.6 with kernel 4.19
(In reply to yugang from comment #28) > Hi Tapani, > > are you still working on this? we still reproduced this issue in Mesa 19.0.6 > in Android kernel 4.19, attached the latest hang error state. I'm having issues running this anymore, I'm getting "Download failed because the resources could not be found" :/ Will need to figure out how to make it work.
(In reply to Tapani Pälli from comment #30) > (In reply to yugang from comment #28) > > Hi Tapani, > > > > are you still working on this? we still reproduced this issue in Mesa 19.0.6 > > in Android kernel 4.19, attached the latest hang error state. > > I'm having issues running this anymore, I'm getting "Download failed because > the resources could not be found" :/ Will need to figure out how to make it > work. OK I got it running now and yes, it is still reproducible (happens also with Iris driver).
(In reply to Tapani Pälli from comment #22) > (In reply to Ren Chenglei from comment #21) > > Do we have any progress on this crash issue? > > I haven't been able to figure out the hang. Some observations though .. I've > disabled EXT_tessellation_shader in the driver, this causes antutu to jump > over the first test and run rest of the tests and there is no hang. I've > also tried disabling usage of push constants (hang seems to be related to > these) but this did not help, hang still occurs. Next thing I'm going to try > is to dump all shaders from first test and see if I could figure out which > ones might be related to the hang. i tried and found the hang related to below shader files(tried to comment the share files loading, the hang would not be reproduced): /sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_vs_vp /sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_tes_tes /sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_tcs_tcs i am trying to dump those files and to see if there are some related finds.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1741.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.