Bug 99942 - [APL] GPU Hang during transcoding video
Summary: [APL] GPU Hang during transcoding video
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Rodrigo Vivi
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-24 11:28 UTC by edwardtseng
Modified: 2017-03-01 22:16 UTC (History)
1 user (show)

See Also:
i915 platform: BXT
i915 features: GPU hang


Attachments
dump from /sys/class/drm/card0/error (892.00 KB, text/plain)
2017-02-24 11:28 UTC, edwardtseng
no flags Details

Description edwardtseng 2017-02-24 11:28:53 UTC
Created attachment 129895 [details]
dump from /sys/class/drm/card0/error

Hi:
While transcoding HEVC 10bit video to H264, the GPU may hang there.
It can recover from the status only if the device reboots.
The Kernel message is as follow:
[  973.709462] [drm] GPU HANG: ecode 9:4:0xacdfbffd, in ffmpeg [6992], reason: Hang on video enhancement ring, action: reset
[  973.720436] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  973.729630] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  973.738477] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  973.748085] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  973.757000] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  973.764295] drm/i915: Resetting chip after gpu hang
[  973.769821] [drm] GuC firmware load skipped
[  983.766780] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[  993.780258] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1003.792770] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1013.805246] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1023.817750] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1024.719067] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1033.830242] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1034.731572] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete
[ 1043.843737] [drm:i915_gem_wait_for_error.part.38] *ERROR* Timed out waiting for the gpu reset to complete

dump from /sys/class/drm/card0/error:
GPU HANG: ecode 9:4:0xacdfbffd, in ffmpeg [6992], reason: Hang on video enhancement ring, action: reset
Time: 1487838143 s 4506 us
Kernel: 4.2.8
Active process (on ring vebox): ffmpeg [6992]
Reset count: 0
Suspend count: 0
PCI ID: 0x5a85
PCI Revision: 0x0b
PCI Subsystem: 8086:2112
IOMMU enabled?: 0
DMC loaded: yes
DMC fw version: 1.7
EIR: 0x00000000
IER: 0x08000000
GTIER gt 0: 0x01010101
GTIER gt 1: 0x01010101
GTIER gt 2: 0x00000070
GTIER gt 3: 0x00000101
PGTBL_ER: 0x00000000
FORCEWAKE: 0xffff0001
DERRMR: 0x2077efef
CCID: 0x00000000
Missed interrupts: 0x00000000

CPU: Intel(R) Celeron(R) CPU J3455 @ 1.50GHz
Git src tag: drm-intel-testing-2016-07-25
Please advise.
Thank you!
Cheers,
Edward Tseng
Comment 1 Ricardo 2017-02-24 21:08:14 UTC
There has been changes to firmware GuC in kernel 4.10, would you update your kernel and let us know if this continues.

Also 01.org has an updated GuC firmware to download https://01.org/linuxgraphics/downloads/broxton-guc-8.7 

I will place the bug into NeedInfo State, however as soon as you add the details please change the bug back to Reopen
Comment 2 Rodrigo Vivi 2017-02-24 21:35:27 UTC
I'm not sure this is guc related:
"[drm] GuC firmware load skipped"

Could you please boot with drm.debug=0xe, reproduce the issue and post the dmesg output here?

Also could you please attach
/sys/kernel/debug/dri/0/i915_guc_load_status 

Thanks,
Rodrigo.
Comment 3 edwardtseng 2017-03-01 11:51:40 UTC
Hi Rodrigo:
I turn on the debug, and the kernel message is as followed:
7>[  486.419915] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, I915_GEM_SW_FINISH
<7>[  486.419917] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, I915_GEM_SW_FINISH
<7>[  486.419919] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, I915_GEM_EXECBUFFER2
<6>[  494.557219] [drm] GPU HANG: ecode 9:4:0xacdfbffd, in ffmpeg [30654], reason: Hang on video enhancement ring, action: reset
<6>[  494.568355] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
<6>[  494.577563] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
6>[  494.586404] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
<6>[  494.596034] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
<6>[  494.604973] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  494.611401] [drm:i915_reset_and_wakeup] resetting chip
5>[  494.611406] drm/i915: Resetting chip after gpu hang
<7>[  494.611425] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, I915_GEM_MADVISE
<7>[  494.616426] [drm:gen8_init_common_ring] Execlists enabled for render ring
<7>[  494.616442] [drm:gen8_init_common_ring] Execlists enabled for blitter ring
<7>[  494.616454] [drm:gen8_init_common_ring] Execlists enabled for bsd ring
<7>[  494.616465] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring
<7>[  494.616482] [drm:intel_guc_setup] GuC fw status: path i915/kbl_guc_ver9_14.bin, fetch NONE, load NONE
<6>[  494.616484] [drm] GuC firmware load skipped
7>[  494.620724] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, DRM_IOCTL_GEM_CLOSE
<7>[  494.620728] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, DRM_IOCTL_GEM_CLOSE
<7>[  494.620732] [drm:drm_ioctl] pid=30654, dev=0xe280, auth=1, DRM_IOCTL_GEM_CLOSE

PS. I cannot find /sys/kernel/debug/dri/0/i915_guc_load_status node.
Is there any debug option I need to setup?

Thank you!
Cheers,
Edward Tseng
Comment 4 Ricardo 2017-03-01 20:31:24 UTC
Rodrigo can you help with the question, once you reply you can reset the assignee to the mailing list
Comment 5 Rodrigo Vivi 2017-03-01 22:12:53 UTC
This is not a firmware related bug since GuC is not getting loaded. So changing category.
Comment 6 Rodrigo Vivi 2017-03-01 22:16:04 UTC
Looking to the error state it looks like it hangs on the very first attempt of using the VECS ring. Very first entry on VECS ring doesn't look like a valid command.
This looks like an user space bug to me.
I assume you are using open source libva with vaapi-intel-driver. If this is the case please go ahead and report this issue to https://github.com/01org/intel-vaapi-driver/issues.

I'm closing this bug here for now. Feel free to reopen if necessary.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.