Created attachment 137096 [details]
/sys/class/drm/card1/error from after a hang
I'm trying a brand new Coffee Lake (i5-8400). The primary GPU is an NVIDIA card; the Intel GPU is used for VA-API only. I can post frames for encoding just fine (I have my own VA-API application), but they seemingly never come back, and the kernel complains:
[ 621.837356] [drm] GPU HANG: ecode 9:0:0x8fd8ffff, reason: Hang on rcs0, action: reset
[ 621.837356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 621.837357] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 621.837357] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 621.837357] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 621.837357] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[ 621.837361] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 629.861470] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 637.861515] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 645.861528] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 653.861539] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 655.845499] i915 0000:00:02.0: Resetting vcs0 after gpu hang
This is 4.15.0-rc8. Hang dump attached.
Hello Steinar, could you try alpha_support=1 in grub or enabling CONFIG_DRM_I915_ALPHA_SUPPORT in kernel config? Thank you.
Also, could you take a look to bug 104377? Thank you.
OK, so I booted with alpha_support=1. This gives me
[ 6.119181] i915 0000:00:02.0: firmware: failed to load i915/kbl_dmc_ver1_01.bin (-2)
[ 6.119185] i915 0000:00:02.0: Direct firmware load for i915/kbl_dmc_ver1_01.bin failed with error -2
[ 6.119186] i915 0000:00:02.0: Failed to load DMC firmware i915/kbl_dmc_ver1_01.bin. Disabling runtime power management.
[ 6.119187] i915 0000:00:02.0: DMC firmware homepage: https://01.org/linuxgraphics/downloads/firmware
but I suppose it's noncritical.
VA-API still does not work; just GPU hang like before.
I looked at 104377, but I'm not entirely sure what to infer from it. The installation is almost entirely the same as on my Haswell, where VA-API _did_ work (I replaced motherboard/CPU/RAM, everything else is exactly the same); I had to upgrade from 4.14 to 4.15 to get CFL support at all, but everything else is in the same version. In particular, I haven't upgraded or downgraded Mesa, and Mesa isn't really in effect here anyway, since OpenGL is handled by the NVIDIA.
I installed the firmware it asked for and rebooted; no change.
Could you get a dmesg or a clean kern.log with debug information,drm.debug=0x1e log_bug_len=4M(or bigger as needed) on grub, from boot till the hang?
Created attachment 137118 [details]
dmesg with drm.debug=0x1e
Created attachment 137119 [details]
/sys/class/drm/card1/error after hang (with drm.debug=0x1e)
Attached new hang log and dmesg.
[ 36.288619] [drm:intel_gpu_reset [i915]] rcs0: timed out on STOP_RING
[ 36.288639] [drm:i915_gem_reset_engine [i915]] context nageru/0 marked guilty (score 10) banned? no
[ 36.288653] [drm:i915_gem_reset_engine [i915]] resetting rcs0 to restart from tail of request 0x2
[ 36.288689] [drm:gen8_init_common_ring [i915]] Execlists enabled for rcs0
[ 36.288959] [drm:init_workarounds_ring [i915]] rcs0: Number of context specific w/a: 14
[ 40.764289] [drm:missed_breadcrumb [i915]] rcs0 missed breadcrumb at intel_breadcrumbs_hangcheck+0x59/0x80 [i915], irq posted? yes, current seqno=2, last=b
[ 43.804482] i915 0000:00:02.0: Resetting rcs0 after gpu hang
Rising priority since is CFL and gpu hang.
Still there in 4.16-rc1.
Also verified it is reproducible with ffmpeg:
root@gruessi:~# ffmpeg -vaapi_device /dev/dri/renderD129 -i elephants_dream_1080p24.y4m -vf 'format=nv12,hwupload' -c:v h264_vaapi test.mp4
which first hangs the GPU, then after about a minute gives up and starts outputting zero bytes per frame.
Upgraded the Intel VA-API driver from 2.0.0 to 2.1.0 (and correspondingly, libva-dev to get the new ABI), which seems to fix the problem. Does this mean this is not-a-bug?
(In reply to Steinar H. Gunderson from comment #13)
> Upgraded the Intel VA-API driver from 2.0.0 to 2.1.0 (and correspondingly,
> libva-dev to get the new ABI), which seems to fix the problem. Does this
> mean this is not-a-bug?
Should be NOTOURBUG. The bug existed in vaapi 2.0.0, but is fixed in newer 2.1.0.