106262 – [BXT] GPU HANG: ecode 9:2:0xbefffffe, in Main Thread [4018], reason: Hang on bsd ring,

Bug 106262 - [BXT] GPU HANG: ecode 9:2:0xbefffffe, in Main Thread [4018], reason: Hang on bsd ring,

Summary: [BXT] GPU HANG: ecode 9:2:0xbefffffe, in Main Thread [4018], reason: Hang on ...

Status:	CLOSED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2018-04-26 22:56 UTC by Patrick Beaulieu
Modified:	2018-08-15 06:21 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	BXT
i915 features:	GPU hang

Attachments
/sys/class/drm/card0/error (57.25 KB, text/plain) 2018-04-26 22:58 UTC, Patrick Beaulieu	no flags	Details
dmesg (47.87 KB, text/plain) 2018-04-26 23:12 UTC, Patrick Beaulieu	no flags	Details
View All

Description Patrick Beaulieu 2018-04-26 22:56:40 UTC

After running for about 4-6wks of constant decoding h264 and copying back to processor via ffmpeg+libva+beignet we are seeing a GPU hang. Has occurred 7-8 times on 5 of 20 different N4200 machines.
This is a headless system with nothing else using the GPU (no display connected).
Environment:
ApolloLake/Broxton N4200
Linux Kernel 4.10 (yocto project pyro 2.3.3 with matched meta-intel)
libva 1.7.3

# vainfo
libva info: VA-API version 0.39.4
libva info: va_getDriverName() returns 0
libva info: Trying to open /usr/lib/dri/i965_drv_video.so
libva info: Found init function __vaDriverInit_0_39
libva info: va_openDriver() returns 0
vainfo: VA-API version: 0.39 (libva 1.7.3)
vainfo: Driver version: Intel i965 driver for Intel(R) Broxton - 1.7.3
vainfo: Supported profile and entrypoints
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSlice
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSlice
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSlice
      VAProfileH264MultiviewHigh      : VAEntrypointVLD
      VAProfileH264MultiviewHigh      : VAEntrypointEncSlice
      VAProfileH264StereoHigh         : VAEntrypointVLD
      VAProfileH264StereoHigh         : VAEntrypointEncSlice
      VAProfileVC1Simple              : VAEntrypointVLD
      VAProfileVC1Main                : VAEntrypointVLD
      VAProfileVC1Advanced            : VAEntrypointVLD
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileVP8Version0_3          : VAEntrypointEncSlice
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSlice
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileVP9Profile0            : VAEntrypointVLD

# uname -a
Linux N4200-test-machine 4.10.17-yocto-standard #2 SMP PREEMPT Tue Apr 24 00:12:09 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Tried cmdline i915.enable_rc6=0 and it did not help.
Will try switching to yocto rocko (kernel 4.12, new libva) but it will take a very long time to confirm if it is fixed by the upgrade since this is a pretty rare hang.

The userspace application tying into ffmpeg/libva/beignet is fairly mature and has been used with a baytrail system for a year without issue on yocto krogoth (kernel 4.4).
I can try to reproduce on a NUC6CAYS but it would take weeks to confirm/deny an issue there.

Will Attach /sys/class/drm/card0/error and dmesg

One question/clarification: Is the additional linux sideloaded gpu firmware mandatory to be loaded for proper operation? I was under the impression it is optional and only required if advanced dmc power states are required.
# dmesg | grep "firmware"
[    1.147206] i915 0000:00:02.0: Direct firmware load for i915/bxt_dmc_ver1_07.bin failed with error -2
[    1.147228] i915 0000:00:02.0: Failed to load DMC firmware [https://01.org/linuxgraphics/intel-linux-graphics-firmwares], disabling runtime power management.
[    2.130898] [drm] GuC firmware load skipped

Comment 1 Patrick Beaulieu 2018-04-26 22:58:31 UTC

Created attachment 139153 [details]
/sys/class/drm/card0/error

Comment 2 Patrick Beaulieu 2018-04-26 23:12:02 UTC

Created attachment 139155 [details]
dmesg

Comment 3 Jani Saarinen 2018-04-27 06:38:14 UTC

Can you check first thast you have FW in right place as you seems to be trying to use latest from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/i915. they should be in /lib/firmware/i915 taht are used in our i915 driver. 

If this fails try using FW with latest drm-tip: https://cgit.freedesktop.org/drm-tip.

Comment 4 Mika Kuoppala 2018-04-27 09:52:50 UTC

Looking at the error state, the hardware never started to actually
execute the request on port[0], but it resting on a previous head/tail

Then we do a reset and and try to a recovery. And looks like the recovery will find the port[0] to be empty. Which might indicate that that port request was finally processed by hw.

Reproducing this with drm-tip might give us better error state.

Comment 5 Jani Saarinen 2018-05-02 06:49:49 UTC

Patrick, are you able to test drm-tip?

Comment 6 Patrick Beaulieu 2018-05-02 21:32:19 UTC

I can test DRM-tip on a small number of units, but given the frequency of occurrence on 20 units, it would take about 6months running on 1-2 units to really have any confidence the issue is gone.

On the 20 unit pool we only are able to run release-candidate images, so drm-tip is not stable enough to be appropriate. On the 20 units, we will begin testing with kernel 4.12 (yocto rocko 2.4.2) and ensure the bxt i915 firmwares are loaded to see if the problem goes away.
What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ?
Should we be using these if the goal is maximum stability?

Are there any particular commits in drm-tip that could explain this issue that we should consider backporting into 4.10 or 4.12?

Comment 7 Jani Saarinen 2018-05-04 07:15:17 UTC

Mika, do you know if there are any fixes in drm-tip as asked on latest comment?

Comment 8 Jani Nikula 2018-05-08 08:35:53 UTC

(In reply to Patrick Beaulieu from comment #6)
> What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ?
> Should we be using these if the goal is maximum stability?

enable_rc6 parameter has been removed upstream. In general, I can't recommend anyone running non-default i915 parameters. Most of them are for debugging, taint the kernel, and aren't properly tested. If you have issues, we'll only support the default settings.

Comment 9 Patrick Beaulieu 2018-05-10 18:31:57 UTC

(In reply to Jani Nikula from comment #8)
> (In reply to Patrick Beaulieu from comment #6)
> > What is the official intel-gfx position on enable_rc6=0 and/or max_cstate=1 ?
> > Should we be using these if the goal is maximum stability?
> 
> enable_rc6 parameter has been removed upstream. In general, I can't
> recommend anyone running non-default i915 parameters. Most of them are for
> debugging, taint the kernel, and aren't properly tested. If you have issues,
> we'll only support the default settings.


How do you align that with errata from intel for certain platforms (baytail, possibly apollolake/broxton as well) that basically says "using low power states may hang your system due to hardware issue, we recommend max C1 and disable rc6"

I won't paste the exact text since the doc is supposed to be confidential:
But somebody has it up for reading.
http://advci.eastasia.cloudapp.azure.com/wordpress/wp-content/uploads/2017/05/570005_Intel_Celeron_Processor_J1900_Sighting_Alert_4995585_Rev1_0.pdf#page=4

There are also numerous bugs/hangs filed in this project where the suggested and functioning workaround is enable_rc6=0

I see some are complaining about enable_rc6 being taken away already:
https://bugs.freedesktop.org/show_bug.cgi?id=105962

Comment 10 Jani Saarinen 2018-06-25 09:54:00 UTC

Patrick, do you have any luck testing latest drm-tip?

So please use https://cgit.freedesktop.org/drm-tip and send dmesg with drm.debug=0x1e log_buf_len=4M?

Comment 11 Jani Saarinen 2018-08-13 09:31:57 UTC

ping, if no feedback I guess we need to close this.

Comment 12 Jani Saarinen 2018-08-15 06:21:15 UTC

Closing as warned.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.