Bug 107356 - [kbl/apl] [drm] GPU hang in Antutu 7.x games with 4.14.52 kernel
Summary: [kbl/apl] [drm] GPU hang in Antutu 7.x games with 4.14.52 kernel
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: 19.0
Hardware: x86-64 (AMD64) other
: medium normal
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-24 08:11 UTC by Ren Chenglei
Modified: 2019-09-25 19:12 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
DMESG LOG (11.09 MB, text/plain)
2018-07-24 08:11 UTC, Ren Chenglei
Details
/sys/class/drm/card0/error (119.99 KB, text/plain)
2018-07-24 08:15 UTC, Ren Chenglei
Details
i965: reemit push constant urb allocation after blorp (1.03 KB, patch)
2018-07-28 12:04 UTC, Lionel Landwerlin
Details | Splinter Review
Dmesg log with patch 140865 (11.59 MB, text/plain)
2018-07-30 14:48 UTC, Ren Chenglei
Details
Logcat with patch 140865 (633.78 KB, text/plain)
2018-07-30 14:51 UTC, Ren Chenglei
Details
error state with patch 140865 (131.99 KB, text/plain)
2018-07-31 01:52 UTC, Ren Chenglei
Details
Logcat with Pixel2 (925.33 KB, text/plain)
2018-08-13 09:17 UTC, Ren Chenglei
Details
GPU HANG error state on Mesa 19.0.6 with kernel 4.19 (1.79 MB, text/plain)
2019-06-13 07:16 UTC, yugang
Details

Description Ren Chenglei 2018-07-24 08:11:20 UTC
Created attachment 140804 [details]
DMESG LOG

AnTuTu benchmark is one popular app for Android smartphone and tablet benchmarking! Recently, when lunched Antutu 7.x version on Android IA and Chrome OS, we encountered GPU hang issue. 
[  348.713449] [drm] GPU HANG: ecode 9:0:0x84df9ffc, in Thread-10 [8813], reason: hang on rcs0, action: reset
[  348.713453] [drm:drm_ioctl] pid=3367, dev=0xe200, auth=1, DRM_IOCTL_WAIT_VBLANK
[  348.713455] [drm:drm_wait_vblank_ioctl] waiting on vblank count 20853, crtc 0
[  348.713485] [drm:i915_reset_device] resetting chip
[  348.713500] i915 0000:00:02.0: Resetting chip for hang on rcs0
[  348.713825] [drm:i915_gem_reset_engine] context Thread-10[8813]/1 marked guilty (score 10) banned? no
[  348.713827] [drm:i915_gem_reset_engine] resetting rcs0 to restart from tail of request 0x413a
[  348.713856] [drm] RC6 on
[  348.714048] [drm:gen8_init_common_ring] Execlists enabled for rcs0
[  348.714055] [drm:init_workarounds_ring] rcs0: Number of context specific w/a: 15
Comment 1 Ren Chenglei 2018-07-24 08:15:27 UTC
Created attachment 140805 [details]
/sys/class/drm/card0/error

Upload /sys/class/drm/card0/error
This issue could not be reproduced with mesa 17.2 on Chrome OS. Since 17.3, we could see GPU hang and app crash.
Some adblog may be helpful.
07-24 03:31:35.914  9130  9130 F DEBUG   :     #08 pc 00000000000a262a  /vendor/lib64/dri/i965_dri.so (_intel_batchbuffer_flush_fence+2522)
07-24 03:31:35.914  9130  9130 F DEBUG   :     #09 pc 0000000000082ed6  /vendor/lib64/dri/i965_dri.so (brw_draw_prims+1462)
07-24 03:31:35.914  9130  9130 F DEBUG   :     #10 pc 000000000057bbb8  /vendor/lib64/dri/i965_dri.so (vbo_exec_DrawElements+440)
Comment 2 Tapani Pälli 2018-07-26 06:28:38 UTC
I'm able to reproduce this and have tried to debug it for some time. My Android stack is a bit old (4.14.35 kernel) but I've rebased Mesa patches on top of recent Mesa commit b3b170a and tested, hang still occurs. Also tried with "intel/ppgtt: memory address alignment" fix but that did not make any difference.
Comment 3 Tapani Pälli 2018-07-26 06:45:47 UTC
I noticed that some shaders fail to load (?) which indicates possible issue within the application itself:

--- 8< ---
07-26 06:40:09.773  9704  9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\Textshader.hlsl_main_vs_vp
07-26 06:40:09.781  9704  9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\Textshader.hlsl_main_ps_fp
07-26 06:40:09.781  9704  9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\SpriteShader_vs.hlsl_main_vp
07-26 06:40:09.781  9704  9748 D RavnStudio: File not found /sdcard/androiddata/shaders/SpriteShader_vs.hlsl_main_vp
07-26 06:40:09.781  9704  9748 D RavnStudio: **shader /sdcard/androiddata/shaders\SpriteShader_vs.hlsl_main_vp failed to load!
07-26 06:40:09.781  9704  9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\SpriteShader_ps.hlsl_main_fp
07-26 06:40:09.781  9704  9748 D RavnStudio: File not found /sdcard/androiddata/shaders/SpriteShader_ps.hlsl_main_fp
07-26 06:40:09.781  9704  9748 D RavnStudio: **shader /sdcard/androiddata/shaders\SpriteShader_ps.hlsl_main_fp failed to load!
07-26 06:40:09.781  9704  9748 D RavnStudio: Linking incomplete shaders
07-26 06:40:09.781  9704  9748 D RavnStudio: process_shader[0x8b31]: /sdcard/androiddata/shaders\ScreenQuadShader_vs.hlsl_main_vp
07-26 06:40:09.781  9704  9748 D RavnStudio: File not found /sdcard/androiddata/shaders/ScreenQuadShader_vs.hlsl_main_vp
07-26 06:40:09.781  9704  9748 D RavnStudio: **shader /sdcard/androiddata/shaders\ScreenQuadShader_vs.hlsl_main_vp failed to load!
07-26 06:40:09.781  9704  9748 D RavnStudio: process_shader[0x8b30]: /sdcard/androiddata/shaders\ScreenQuadShader_ps.hlsl_main_fp
07-26 06:40:09.781  9704  9748 D RavnStudio: File not found /sdcard/androiddata/shaders/ScreenQuadShader_ps.hlsl_main_fp
07-26 06:40:09.781  9704  9748 D RavnStudio: **shader /sdcard/androiddata/shaders\ScreenQuadShader_ps.hlsl_main_fp failed to load!
07-26 06:40:09.781  9704  9748 D RavnStudio: Linking incomplete shaders
--- 8< ---
Comment 4 Ren Chenglei 2018-07-26 06:53:39 UTC
Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with 17.3 and later.
Comment 5 Tapani Pälli 2018-07-26 07:10:09 UTC
(In reply to Ren Chenglei from comment #4)
> Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As
> the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with
> 17.3 and later.

Maybe not the root cause but I don't feel very well debugging app that fails to load shaders or other assets, it means app is not working the way it was designed. I also saw following files fail to load:

--- 8< ---
File not found /sdcard/androiddata/level/level/chunks_astc.cnk
File not found /sdcard/androiddata/level/level/chunks.cnk
--- 8< ---
Comment 6 Tapani Pälli 2018-07-26 07:11:49 UTC
(In reply to Tapani Pälli from comment #5)
> (In reply to Ren Chenglei from comment #4)
> > Hi Tapani, thanks for the update. Yes, but it may be not the root cause. As
> > the Antutu 7.x could lunch on Chrome OS with mesa 17.2, but crashes with
> > 17.3 and later.
> 
> Maybe not the root cause but I don't feel very well debugging app that fails
> to load shaders or other assets, it means app is not working the way it was
> designed. I also saw following files fail to load:
> 
> --- 8< ---
> File not found /sdcard/androiddata/level/level/chunks_astc.cnk
> File not found /sdcard/androiddata/level/level/chunks.cnk
> --- 8< ---

+ File not found /sdcard/androiddata/level/level/chunkpaths.bin
Comment 7 Tapani Pälli 2018-07-26 10:32:18 UTC
Problem seems to be that we fail to submit batchbuffer:

i965: Failed to submit batchbuffer: I/O error

This is considered critical and driver exits. By ignoring these failures and just printing errors to the log, I got 'Coastline' and rest of the benchmark running.
Comment 8 Tapani Pälli 2018-07-26 10:43:07 UTC
(In reply to Tapani Pälli from comment #7)
> Problem seems to be that we fail to submit batchbuffer:
> 
> i965: Failed to submit batchbuffer: I/O error
> 
> This is considered critical and driver exits. By ignoring these failures and
> just printing errors to the log, I got 'Coastline' and rest of the benchmark
> running.

Having said that, the I/O error probably happens because of the hang (context gets banned) so I'm just working around the actual issue which is the gpu hang.
Comment 9 Tapani Pälli 2018-07-27 08:59:13 UTC
Based on the investigations on error_state it seems that somehow we possibly end up violating following rule from PRM:

"The 3DSTATE_CONSTANT_VS must be reprogrammed prior to the next 3DPRIMITIVE
command after programming the 3DSTATE_PUSH_CONSTANT_ALLOC_VS"

In some cases it seems we do ALLOC and then 3DPRIMITIVE without programming constant state first.
Comment 10 Kenneth Graunke 2018-07-28 07:28:30 UTC
I don't see how that can happen, we run the atom which emits 3DSTATE_CONSTANT_* on every draw call, based on push_constants_dirty bits which are set to true every time we emit 3DSTATE_PUSH_CONSTANT_ALLOC_VS.  The code even cites that exact rule.
Comment 11 Lionel Landwerlin 2018-07-28 11:47:41 UTC
(In reply to Kenneth Graunke from comment #10)
> I don't see how that can happen, we run the atom which emits
> 3DSTATE_CONSTANT_* on every draw call, based on push_constants_dirty bits
> which are set to true every time we emit 3DSTATE_PUSH_CONSTANT_ALLOC_VS. 
> The code even cites that exact rule.

Maybe this is some kind of interaction with Blorp.
Maybe adding BRW_NEW_BLORP to brw_tracked_state gen7_push_constant_space ?

In Anv we reemit the push constant allocation after a blorp set of commands.
Comment 12 Lionel Landwerlin 2018-07-28 12:04:50 UTC
Created attachment 140865 [details] [review]
i965: reemit push constant urb allocation after blorp

Let me know if that makes sense/works
Comment 13 Ren Chenglei 2018-07-30 05:53:50 UTC
(In reply to Lionel Landwerlin from comment #12)
> Created attachment 140865 [details] [review] [review]
> i965: reemit push constant urb allocation after blorp
> 
> Let me know if that makes sense/works

I have tried the patch, but it still not work. The APP crashes again.
Comment 14 Lionel Landwerlin 2018-07-30 11:35:24 UTC
(In reply to Ren Chenglei from comment #13)
> (In reply to Lionel Landwerlin from comment #12)
> > Created attachment 140865 [details] [review] [review] [review]
> > i965: reemit push constant urb allocation after blorp
> > 
> > Let me know if that makes sense/works
> 
> I have tried the patch, but it still not work. The APP crashes again.

Thanks, could you upload the error state with when running with that patch?
Comment 15 Ren Chenglei 2018-07-30 14:48:27 UTC
Created attachment 140896 [details]
Dmesg log with patch 140865
Comment 16 Lionel Landwerlin 2018-07-30 14:50:12 UTC
(In reply to Ren Chenglei from comment #15)
> Created attachment 140896 [details]
> Dmesg log with patch 140865

Sorry, that's dmesg, not the error state (/sys/class/drm/card0/error).
Comment 17 Ren Chenglei 2018-07-30 14:51:32 UTC
Created attachment 140897 [details]
Logcat with patch 140865
Comment 18 Ren Chenglei 2018-07-30 14:54:27 UTC
(In reply to Lionel Landwerlin from comment #16)
> (In reply to Ren Chenglei from comment #15)
> > Created attachment 140896 [details]
> > Dmesg log with patch 140865
> 
> Sorry, that's dmesg, not the error state (/sys/class/drm/card0/error).

So sorry for that, could I update it tomorrow, as I have went back and device has been closed.
Comment 19 Lionel Landwerlin 2018-07-30 14:58:39 UTC
Sure
Comment 20 Ren Chenglei 2018-07-31 01:52:53 UTC
Created attachment 140904 [details]
error state with patch 140865
Comment 21 Ren Chenglei 2018-08-10 11:16:31 UTC
Do we have any progress on this crash issue?
Comment 22 Tapani Pälli 2018-08-10 11:51:34 UTC
(In reply to Ren Chenglei from comment #21)
> Do we have any progress on this crash issue?

I haven't been able to figure out the hang. Some observations though .. I've disabled EXT_tessellation_shader in the driver, this causes antutu to jump over the first test and run rest of the tests and there is no hang. I've also tried disabling usage of push constants (hang seems to be related to these) but this did not help, hang still occurs. Next thing I'm going to try is to dump all shaders from first test and see if I could figure out which ones might be related to the hang.
Comment 23 Ren Chenglei 2018-08-10 14:00:19 UTC
(In reply to Tapani Pälli from comment #22)
> (In reply to Ren Chenglei from comment #21)
> > Do we have any progress on this crash issue?
> 
> I haven't been able to figure out the hang. Some observations though .. I've
> disabled EXT_tessellation_shader in the driver, this causes antutu to jump
> over the first test and run rest of the tests and there is no hang. I've
> also tried disabling usage of push constants (hang seems to be related to
> these) but this did not help, hang still occurs. Next thing I'm going to try
> is to dump all shaders from first test and see if I could figure out which
> ones might be related to the hang.

Thanks Tapani for the updates!
I tried following:

diff --git a/src/mesa/drivers/dri/i965/intel_batchbuffer.c b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
index df999ffeb1d0..325f28b35e60 100644
--- a/src/mesa/drivers/dri/i965/intel_batchbuffer.c
+++ b/src/mesa/drivers/dri/i965/intel_batchbuffer.c
@@ -823,7 +823,7 @@ submit_batch(struct brw_context *brw, int in_fence_fd, int *out_fence_fd)
    if (ret != 0) {
       fprintf(stderr, "i965: Failed to submit batchbuffer: %s\n",
               strerror(-ret));
-      exit(1);
+      //exit(1);
    }

    return ret;

If we don't exit, after GPU hang, it will continue the rest test. But the first part test is still black.
Comment 24 Ren Chenglei 2018-08-13 09:16:04 UTC
(In reply to Tapani Pälli from comment #22)
> (In reply to Ren Chenglei from comment #21)
> > Do we have any progress on this crash issue?
> 
> I haven't been able to figure out the hang. Some observations though .. I've
> disabled EXT_tessellation_shader in the driver, this causes antutu to jump
> over the first test and run rest of the tests and there is no hang. I've
> also tried disabling usage of push constants (hang seems to be related to
> these) but this did not help, hang still occurs. Next thing I'm going to try
> is to dump all shaders from first test and see if I could figure out which
> ones might be related to the hang.

I tried on Pixel 2 and the first test could work fine. But we still could get shaders loading fail when we use Pixel 2. Please refer to attachment pixel2_adblog.txt.
Comment 25 Ren Chenglei 2018-08-13 09:17:02 UTC
Created attachment 141058 [details]
Logcat with Pixel2
Comment 26 Ren Chenglei 2018-08-16 11:07:11 UTC
Maybe I caught the root cause of the GPU Hang issue:

diff --git a/src/mesa/vbo/vbo_exec_array.c b/src/mesa/vbo/vbo_exec_array.c
index 58bba208db10..fa871473ca14 100644
--- a/src/mesa/vbo/vbo_exec_array.c
+++ b/src/mesa/vbo/vbo_exec_array.c
@@ -1001,19 +1003,24 @@ vbo_exec_DrawElements(GLenum mode, GLsizei count, GLenum type,
    FLUSH_FOR_DRAW(ctx);

    if (_mesa_is_no_error_enabled(ctx)) {
-      _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));
+      //The following function will cause screen black
+      //_mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));

       if (ctx->NewState)
          _mesa_update_state(ctx);
    } else {
-      _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));
+      //The following function will cause screen black
+      //_mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));

       if (!_mesa_validate_DrawElements(ctx, mode, count, type, indices))
          return;
    }

-   vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, 0, ~0,
+   if (mode != GL_PATCHES)
+     vbo_validated_drawrangeelements(ctx, mode, GL_FALSE, 0, ~0,
                                    count, type, indices, 0, 1, 0);
+   else
+     ALOGD("mesa - log - skip draw as GL_PATCHES cause GPU hang");
 }

Tapani, do we have support GL_PATCHES in mesa, which is only available with GLES 3.2 or greater.
https://www.khronos.org/registry/OpenGL-Refpages/es3/

When the mode is GL_PATCHES, we skip the draw, and the GPU hang can't be reproduced. 

BTW, the screen will be black in the first part test, I commented function:
_mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));
It will be better, but there is still some issue.
Comment 27 Tapani Pälli 2018-08-16 11:58:57 UTC
(In reply to Ren Chenglei from comment #26)
> Tapani, do we have support GL_PATCHES in mesa, which is only available with
> GLES 3.2 or greater.
> https://www.khronos.org/registry/OpenGL-Refpages/es3/

Yes, we do support GL_PATCHES which is used with tessellation. This is good hint though, we could perhaps trace the actual frame that causes the hang .. unfortunately currently we don't have a way to take GL traces on Android :/ But we could dump the shaders at point when such draw call arrives.
 
> When the mode is GL_PATCHES, we skip the draw, and the GPU hang can't be
> reproduced. 
> 
> BTW, the screen will be black in the first part test, I commented function:
> _mesa_set_draw_vao(ctx, ctx->Array.VAO, enabled_filter(ctx));
> It will be better, but there is still some issue.

Yep we can't really skip these calls, they are necessary so that we have correct VAO in place.
Comment 28 yugang 2019-06-13 07:14:41 UTC
Hi Tapani,

are you still working on this? we still reproduced this issue in Mesa 19.0.6 in Android kernel 4.19, attached the latest hang error state.
Comment 29 yugang 2019-06-13 07:16:07 UTC
Created attachment 144526 [details]
GPU HANG error state on Mesa 19.0.6 with kernel 4.19
Comment 30 Tapani Pälli 2019-06-13 10:52:52 UTC
(In reply to yugang from comment #28)
> Hi Tapani,
> 
> are you still working on this? we still reproduced this issue in Mesa 19.0.6
> in Android kernel 4.19, attached the latest hang error state.

I'm having issues running this anymore, I'm getting "Download failed because the resources could not be found" :/ Will need to figure out how to make it work.
Comment 31 Tapani Pälli 2019-06-13 11:18:35 UTC
(In reply to Tapani Pälli from comment #30)
> (In reply to yugang from comment #28)
> > Hi Tapani,
> > 
> > are you still working on this? we still reproduced this issue in Mesa 19.0.6
> > in Android kernel 4.19, attached the latest hang error state.
> 
> I'm having issues running this anymore, I'm getting "Download failed because
> the resources could not be found" :/ Will need to figure out how to make it
> work.

OK I got it running now and yes, it is still reproducible (happens also with Iris driver).
Comment 32 yugang 2019-08-02 10:20:31 UTC
(In reply to Tapani Pälli from comment #22)
> (In reply to Ren Chenglei from comment #21)
> > Do we have any progress on this crash issue?
> 
> I haven't been able to figure out the hang. Some observations though .. I've
> disabled EXT_tessellation_shader in the driver, this causes antutu to jump
> over the first test and run rest of the tests and there is no hang. I've
> also tried disabling usage of push constants (hang seems to be related to
> these) but this did not help, hang still occurs. Next thing I'm going to try
> is to dump all shaders from first test and see if I could figure out which
> ones might be related to the hang.

i tried and found the hang related to below shader files(tried to comment the share files loading, the hang would not be reproduced):

/sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_vs_vp
/sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_tes_tes
/sdcard/androiddata/shaders\floop_tesselation.glsl_shadow_tcs_tcs

i am trying to dump those files and to see if there are some related finds.
Comment 33 GitLab Migration User 2019-09-25 19:12:49 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1741.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.