Bug 106433 - Android: surfaceflinger dies on GPU HANG with latest drm
Summary: Android: surfaceflinger dies on GPU HANG with latest drm
Status: RESOLVED NOTOURBUG
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: x86-64 (AMD64) other
: medium critical
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-07 20:21 UTC by Mauro Rossi
Modified: 2018-07-24 07:09 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
logcat of few bootanimation cycles (2.45 MB, text/plain)
2018-05-07 20:21 UTC, Mauro Rossi
Details
dmesg of few bootanimation cycles (111.81 KB, text/plain)
2018-05-07 20:22 UTC, Mauro Rossi
Details
Dump of /sys/class/drm/card0/error (886.21 KB, text/plain)
2018-05-07 20:24 UTC, Mauro Rossi
Details
dmesg of several Android GUI restarts (126.14 KB, text/plain)
2018-06-07 17:58 UTC, Mauro Rossi
Details
logcat with drm.debug=30 zipped (1.19 MB, application/zip)
2018-06-07 18:01 UTC, Mauro Rossi
Details

Description Mauro Rossi 2018-05-07 20:21:19 UTC
Created attachment 139411 [details]
logcat of few bootanimation cycles

Hi,
I'm testing Android 8.1 (oreo-x86 branch of android-x86)
with the following gfx stack:

- drm_hwcomposer of freedesktop.org (hwctwo enabled and also with robherring branch handle-rework)
- latest gbm_gralloc (robherring branch handle-rework, but also happening with branches prior to handle-rework)
- latest libdrm, but it happens will all releases from 2.4.89(and before) to 2.4.91
- latest kernel 4.17rc4, but it happens with all kernels

Hardware: Laptop Lenovo T460 with Skylake GT2

Synthomps: Bootanimation completes, but then the Android GUI hangs and surfaceflinger service is killed, Bootanimations restarts (GUI bootloop)

Very difficult to get debug logs, about what is causing signal 1 (SIGHUP) killing surfaceflinger, but I could finally trace it.

05-07 21:36:58.689     0     0 I         : [drm] GPU HANG: ecode 9:0:0x00280000, in surfaceflinger [2497], reason: No progress on rcs0, action: reset
05-07 21:36:58.689     0     0 I         : [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
05-07 21:36:58.689     0     0 I         : [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
05-07 21:36:58.689     0     0 I         : [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
05-07 21:36:58.689     0     0 I         : [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
05-07 21:36:58.689     0     0 I         : [drm] GPU crash dump saved to /sys/class/drm/card0/error
05-07 21:36:58.689     0     0 I i915 0000: 00:02.0: Resetting rcs0 after gpu hang


logcat, dmesg and GPU crash dump /sys/class/drm/card0/error are provided.

Please help to identify the root cause as this is probably the only show stopper preventing drm_hwcomposer (hwctwo) + gbm_gralloc + libdrm from booting Android oreo-x86 with full freedesktop stack.

I am available to support further investigations and testing of patches

Mauro Rossi
android-x86 team
Comment 1 Mauro Rossi 2018-05-07 20:22:39 UTC
Created attachment 139412 [details]
dmesg of few bootanimation cycles
Comment 2 Mauro Rossi 2018-05-07 20:24:01 UTC
Created attachment 139413 [details]
Dump of /sys/class/drm/card0/error
Comment 3 Chris Wilson 2018-05-07 20:27:19 UTC
The batch has been overwritten by pixel data. Instinct would be that the (foreign?) buffer allocation didn't match expectations.
Comment 4 Tapani Pälli 2018-05-08 06:19:02 UTC
Have you tried bisecting Mesa if some particular commit causes this? This could be also some difference between cros_gralloc and gbm_gralloc. Unfortunately I can't test anything on Android ATM but I would advice trying bisecting first.
Comment 5 Mauro Rossi 2018-06-07 17:58:41 UTC
Created attachment 140072 [details]
dmesg of several Android GUI restarts

Hi,
as an update I have conducted further tests on Mesa branches 18.0, 18.1 (including also Android-IA patches) and mesa-dev
prior to start blind bisecting, which with my setup would take ages.

The results are that for all versions from 18.0 to mesa-dev I get the same problem that Android boot does not complete, but the GUI restart does not happen always at the same time and in the same way.

In most cases the bootanimation is interrupted, in some others GUI freezes at Status Bar drawing.

Reiterating the GUI restarts for a while I have collected dmesg and logcat
with drm.debug=30 in order to trace for problems happening during several attempts

Could you please have a look, because GPU hang is not systematic, I'd like to understand what is causing SurfaceFlinger process to Hang.

Mauro
Comment 6 Mauro Rossi 2018-06-07 18:01:04 UTC
Created attachment 140073 [details]
logcat with drm.debug=30 zipped
Comment 7 Tapani Pälli 2018-06-11 05:16:08 UTC
Since Android-IA works OK one option would be to simply go through differences in kernel, Mesa, minigbm vs gbm_gralloc. Have you tried to use Android-IA kernel tree?
Comment 8 Mauro Rossi 2018-07-23 19:55:17 UTC
Hi, this one can be closed as using latest mesa-dev with default support for dma-bufs (prime fd) the oreo-x86 could boot.

I think the problem was coming from combining gbm_gralloc, 
which supports prime fd with mesa in a commit supporting flink.

Mauro
Comment 9 Tapani Pälli 2018-07-24 07:09:57 UTC
resolving, see comment #8


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.