104891 – [CFL] GPU hang during VA-API H.264 encoding

Bug 104891 - [CFL] GPU hang during VA-API H.264 encoding

Summary: [CFL] GPU hang during VA-API H.264 encoding

Status:	CLOSED NOTOURBUG

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-01-31 22:45 UTC by Steinar H. Gunderson
Modified:	2018-03-02 15:12 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	CFL
i915 features:	GPU hang

Attachments
/sys/class/drm/card1/error from after a hang (72.49 KB, text/plain) 2018-01-31 22:45 UTC, Steinar H. Gunderson	no flags	Details
dmesg with drm.debug=0x1e (103.26 KB, text/plain) 2018-02-01 18:24 UTC, Steinar H. Gunderson	no flags	Details
/sys/class/drm/card1/error after hang (with drm.debug=0x1e) (67.76 KB, text/plain) 2018-02-01 18:25 UTC, Steinar H. Gunderson	no flags	Details
View All

Description Steinar H. Gunderson 2018-01-31 22:45:11 UTC

Created attachment 137096 [details]
/sys/class/drm/card1/error from after a hang

Hi,

I'm trying a brand new Coffee Lake (i5-8400). The primary GPU is an NVIDIA card; the Intel GPU is used for VA-API only. I can post frames for encoding just fine (I have my own VA-API application), but they seemingly never come back, and the kernel complains:

[  621.837356] [drm] GPU HANG: ecode 9:0:0x8fd8ffff, reason: Hang on rcs0, action: reset
[  621.837356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  621.837357] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  621.837357] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  621.837357] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  621.837357] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[  621.837361] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  629.861470] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  637.861515] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  645.861528] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  653.861539] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[  655.845499] i915 0000:00:02.0: Resetting vcs0 after gpu hang

This is 4.15.0-rc8. Hang dump attached.

Comment 1 Elizabeth 2018-02-01 17:14:46 UTC

Hello Steinar, could you try alpha_support=1 in grub or enabling CONFIG_DRM_I915_ALPHA_SUPPORT in kernel config? Thank you.

Comment 2 Elizabeth 2018-02-01 17:17:48 UTC

Also, could you take a look to bug 104377? Thank you.

Comment 3 Steinar H. Gunderson 2018-02-01 17:21:59 UTC

OK, so I booted with alpha_support=1. This gives me

[    6.119181] i915 0000:00:02.0: firmware: failed to load i915/kbl_dmc_ver1_01.bin (-2)
[    6.119185] i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_01.bin failed with error -2
[    6.119186] i915 0000:00:02.0: Failed to load DMC firmware i915/kbl_dmc_ver1_01.bin. Disabling runtime power management.
[    6.119187] i915 0000:00:02.0: DMC firmware homepage: https://01.org/linuxgraphics/downloads/firmware

but I suppose it's noncritical.

VA-API still does not work; just GPU hang like before.

I looked at 104377, but I'm not entirely sure what to infer from it. The installation is almost entirely the same as on my Haswell, where VA-API _did_ work (I replaced motherboard/CPU/RAM, everything else is exactly the same); I had to upgrade from 4.14 to 4.15 to get CFL support at all, but everything else is in the same version. In particular, I haven't upgraded or downgraded Mesa, and Mesa isn't really in effect here anyway, since OpenGL is handled by the NVIDIA.

Comment 4 Steinar H. Gunderson 2018-02-01 17:28:58 UTC

I installed the firmware it asked for and rebooted; no change.

Comment 5 Elizabeth 2018-02-01 18:07:58 UTC

Could you get a dmesg or a clean kern.log with debug information,drm.debug=0x1e log_bug_len=4M(or bigger as needed) on grub, from boot till the hang?

Comment 6 Steinar H. Gunderson 2018-02-01 18:24:48 UTC

Created attachment 137118 [details]
dmesg with drm.debug=0x1e

Comment 7 Steinar H. Gunderson 2018-02-01 18:25:13 UTC

Created attachment 137119 [details]
/sys/class/drm/card1/error after hang (with drm.debug=0x1e)

Comment 8 Steinar H. Gunderson 2018-02-01 18:25:43 UTC

Attached new hang log and dmesg.

Comment 9 Elizabeth 2018-02-01 18:49:20 UTC

From dmesg:

[   36.288619] [drm:intel_gpu_reset [i915]] rcs0: timed out on STOP_RING
[   36.288639] [drm:i915_gem_reset_engine [i915]] context nageru[1504]/0 marked guilty (score 10) banned? no
[   36.288653] [drm:i915_gem_reset_engine [i915]] resetting rcs0 to restart from tail of request 0x2
[   36.288689] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
[   36.288959] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 14
[   40.764289] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x59/0x80 [i915], irq posted? yes, current seqno=2, last=b
[   43.804482] i915 0000:00:02.0: Resetting rcs0 after gpu hang

Comment 10 Elizabeth 2018-02-01 20:22:27 UTC

Rising priority since is CFL and gpu hang.

Comment 11 Steinar H. Gunderson 2018-02-17 13:48:11 UTC

Still there in 4.16-rc1.

Comment 12 Steinar H. Gunderson 2018-02-17 13:51:56 UTC

Also verified it is reproducible with ffmpeg:

root@gruessi:~# ffmpeg -vaapi_device /dev/dri/renderD129 -i elephants_dream_1080p24.y4m -vf 'format=nv12,hwupload' -c:v h264_vaapi test.mp4 

which first hangs the GPU, then after about a minute gives up and starts outputting zero bytes per frame.

Comment 13 Steinar H. Gunderson 2018-02-17 15:37:07 UTC

Upgraded the Intel VA-API driver from 2.0.0 to 2.1.0 (and correspondingly, libva-dev to get the new ABI), which seems to fix the problem. Does this mean this is not-a-bug?

Comment 14 Elizabeth 2018-02-19 16:14:33 UTC

(In reply to Steinar H. Gunderson from comment #13)
> Upgraded the Intel VA-API driver from 2.0.0 to 2.1.0 (and correspondingly,
> libva-dev to get the new ABI), which seems to fix the problem. Does this
> mean this is not-a-bug?
Should be NOTOURBUG. The bug existed in vaapi 2.0.0, but is fixed in newer 2.1.0.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.