After running for about 4-6wks of constant decoding h264 and copying back to processor via ffmpeg+libva+beignet we are seeing a GPU hang. Has occurred 7-8 times on 5 of 20 different N4200 machines. This is a headless system with nothing else using the GPU (no display connected). Environment: ApolloLake/Broxton N4200 Linux Kernel 4.10 (yocto project pyro 2.3.3 with matched meta-intel) libva 1.7.3 # vainfo libva info: VA-API version 0.39.4 libva info: va_getDriverName() returns 0 libva info: Trying to open /usr/lib/dri/i965_drv_video.so libva info: Found init function __vaDriverInit_0_39 libva info: va_openDriver() returns 0 vainfo: VA-API version: 0.39 (libva 1.7.3) vainfo: Driver version: Intel i965 driver for Intel(R) Broxton - 1.7.3 vainfo: Supported profile and entrypoints VAProfileMPEG2Simple : VAEntrypointVLD VAProfileMPEG2Main : VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointVLD VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice VAProfileH264Main : VAEntrypointVLD VAProfileH264Main : VAEntrypointEncSlice VAProfileH264High : VAEntrypointVLD VAProfileH264High : VAEntrypointEncSlice VAProfileH264MultiviewHigh : VAEntrypointVLD VAProfileH264MultiviewHigh : VAEntrypointEncSlice VAProfileH264StereoHigh : VAEntrypointVLD VAProfileH264StereoHigh : VAEntrypointEncSlice VAProfileVC1Simple : VAEntrypointVLD VAProfileVC1Main : VAEntrypointVLD VAProfileVC1Advanced : VAEntrypointVLD VAProfileNone : VAEntrypointVideoProc VAProfileJPEGBaseline : VAEntrypointVLD VAProfileJPEGBaseline : VAEntrypointEncPicture VAProfileVP8Version0_3 : VAEntrypointVLD VAProfileVP8Version0_3 : VAEntrypointEncSlice VAProfileHEVCMain : VAEntrypointVLD VAProfileHEVCMain : VAEntrypointEncSlice VAProfileHEVCMain10 : VAEntrypointVLD VAProfileVP9Profile0 : VAEntrypointVLD # uname -a Linux N4200-test-machine 4.10.17-yocto-standard #2 SMP PREEMPT Tue Apr 24 00:12:09 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux Tried cmdline i915.enable_rc6=0 and it did not help. Will try switching to yocto rocko (kernel 4.12, new libva) but it will take a very long time to confirm if it is fixed by the upgrade since this is a pretty rare hang. The userspace application tying into ffmpeg/libva/beignet is fairly mature and has been used with a baytrail system for a year without issue on yocto krogoth (kernel 4.4). I can try to reproduce on a NUC6CAYS but it would take weeks to confirm/deny an issue there. Will Attach /sys/class/drm/card0/error and dmesg One question/clarification: Is the additional linux sideloaded gpu firmware mandatory to be loaded for proper operation? I was under the impression it is optional and only required if advanced dmc power states are required. # dmesg | grep "firmware" [ 1.147206] i915 0000:00:02.0: Direct firmware load for i915/bxt_dmc_ver1_07.bin failed with error -2 [ 1.147228] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management. [ 2.130898] [drm] GuC firmware load skipped
Created attachment 139153 [details] /sys/class/drm/card0/error
Created attachment 139155 [details] dmesg
Can you check first thast you have FW in right place as you seems to be trying to use latest from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915. they should be in /lib/firmware/i915 taht are used in our i915 driver. If this fails try using FW with latest drm-tip: https://cgit.freedesktop.org/drm-tip.
Looking at the error state, the hardware never started to actually execute the request on port[0], but it resting on a previous head/tail Then we do a reset and and try to a recovery. And looks like the recovery will find the port[0] to be empty. Which might indicate that that port request was finally processed by hw. Reproducing this with drm-tip might give us better error state.
Patrick, are you able to test drm-tip?
I can test DRM-tip on a small number of units, but given the frequency of occurrence on 20 units, it would take about 6months running on 1-2 units to really have any confidence the issue is gone. On the 20 unit pool we only are able to run release-candidate images, so drm-tip is not stable enough to be appropriate. On the 20 units, we will begin testing with kernel 4.12 (yocto rocko 2.4.2) and ensure the bxt i915 firmwares are loaded to see if the problem goes away. What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ? Should we be using these if the goal is maximum stability? Are there any particular commits in drm-tip that could explain this issue that we should consider backporting into 4.10 or 4.12?
Mika, do you know if there are any fixes in drm-tip as asked on latest comment?
(In reply to Patrick Beaulieu from comment #6) > What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ? > Should we be using these if the goal is maximum stability? enable_rc6 parameter has been removed upstream. In general, I can't recommend anyone running non-default i915 parameters. Most of them are for debugging, taint the kernel, and aren't properly tested. If you have issues, we'll only support the default settings.
(In reply to Jani Nikula from comment #8) > (In reply to Patrick Beaulieu from comment #6) > > What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ? > > Should we be using these if the goal is maximum stability? > > enable_rc6 parameter has been removed upstream. In general, I can't > recommend anyone running non-default i915 parameters. Most of them are for > debugging, taint the kernel, and aren't properly tested. If you have issues, > we'll only support the default settings. How do you align that with errata from intel for certain platforms (baytail, possibly apollolake/broxton as well) that basically says "using low power states may hang your system due to hardware issue, we recommend max C1 and disable rc6" I won't paste the exact text since the doc is supposed to be confidential: But somebody has it up for reading. http://advci.eastasia.cloudapp.azure.com/wordpress/wp-content/uploads/2017/05/570005_Intel_Celeron_Processor_J1900_Sighting_Alert_4995585_Rev1_0.pdf#page=4 There are also numerous bugs/hangs filed in this project where the suggested and functioning workaround is enable_rc6=0 I see some are complaining about enable_rc6 being taken away already: https://bugs.freedesktop.org/show_bug.cgi?id=105962
Patrick, do you have any luck testing latest drm-tip? So please use https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M?
ping, if no feedback I guess we need to close this.
Closing as warned.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.