Bug 97872 - GPU hang with libva (gstreamer)
Summary: GPU hang with libva (gstreamer)
Status: CLOSED FIXED
Alias: None
Product: libva
Classification: Unclassified
Component: intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: haihao
QA Contact: Sean V Kelley
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-09-20 13:12 UTC by Florent Thiéry
Modified: 2016-10-28 20:03 UTC (History)
3 users (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
gpu crash dump (3.06 MB, text/plain)
2016-09-20 13:12 UTC, Florent Thiéry
Details
A patch to use BSD0 ring (1.58 KB, patch)
2016-10-25 01:06 UTC, haihao
Details | Splinter Review
patched PKGFILE for libva-intel-driver on Arch (1.41 KB, application/gzip)
2016-10-25 08:24 UTC, Florent Thiéry
Details
attachment-31724-0.html (3.62 KB, text/html)
2016-10-26 06:44 UTC, Sean V Kelley
Details

Description Florent Thiéry 2016-09-20 13:12:43 UTC
Created attachment 126659 [details]
gpu crash dump

hw platforms:
- Skylake i7 NUC6i7KYK (GPU Iris Pro 580)
- Skylake i5 NUC6i5SYK (GPU Iris Graphics 540)

The problem does not happen with Skylake i3 NUC6i3SYL (GPU Iris HD 520); it happens both headless and under xorg

libdrm: 2.4.70

Running the following command twice in a row (quickly) results in a GPU reset (gstreamer master required):

gst-launch-1.0 videotestsrc num-buffers=100 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1 ! vaapih264enc tune=low-power ! fakesink silent=false -v

[  613.290101] [drm] stuck on bsd2 ring
[  613.290989] [drm] GPU HANG: ecode 9:3:0xcb79ffc4, in videotestsrc0:s [2622], reason: Engine(s) hung, action: reset
[  613.290992] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  613.290995] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  613.290997] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  613.290999] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  613.291001] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  613.293290] drm/i915: Resetting chip after gpu hang
Comment 1 Florent Thiéry 2016-09-20 13:13:37 UTC
Btw, running 4.7.4-1-ARCH
Comment 2 Sean V Kelley 2016-10-13 19:08:45 UTC
Unfortunately, this bug was created under the graphics product.  It should have been created for VAPPI instead.

I'm going to move it.
Comment 3 Sean V Kelley 2016-10-13 19:10:54 UTC
@haihao, let's take a look.
Comment 4 Sean V Kelley 2016-10-24 17:12:07 UTC
Have you also tested with 4.8?

Thanks,

Sean
Comment 5 Florent Thiéry 2016-10-24 17:17:21 UTC
Yes, just tested on Linux nuc6i5 4.8.4-1-ARCH #1 SMP PREEMPT Sat Oct 22 18:26:57 CEST 2016 x86_64 GNU/Linux with gstreamer-vaapi 1.9.90+1+g9414815-1
Comment 6 Josep Torra 2016-10-24 21:25:21 UTC
(In reply to Sean V Kelley from comment #4)
> Have you also tested with 4.8?
> 
> Thanks,
> 
> Sean

Hi Sean,

I'm also reproducing this issue with the same hardware.

I'd spent some time with it 3 weeks ago and was able to reproduce it in Ubuntu 16.04 with graphics stack updated via padoka PPA and updated kernels from mainline [2]. I'd tried with latest 4.8 RC then and the drm-intel-nightly.

I'm planning to spend some time on this issue during by the end of this week. Please let me know which kernel do you want I'll have installed in the system.

[1] https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa?field.series_filter=xenial
[2] http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/
Comment 7 haihao 2016-10-25 01:06:57 UTC
Created attachment 127525 [details] [review]
A patch to use BSD0 ring

Could you have a try with the attached patch?
Comment 8 Florent Thiéry 2016-10-25 08:24:18 UTC
Works like a charm on GT3e; if anyone interested to test on Arch, i attached the corresponding patched libva-intel-driver PKGFILE.

Josep, if you can test on GT4e that would be nice (don't have any access to this hw anymore).
Comment 9 Florent Thiéry 2016-10-25 08:24:44 UTC
Created attachment 127534 [details]
patched PKGFILE for libva-intel-driver on Arch
Comment 10 Florent Thiéry 2016-10-25 10:52:48 UTC
Not seeing any regression on GT2 either
Comment 11 sreerenj 2016-10-25 11:02:19 UTC
(In reply to Florent Thiéry from comment #8)
> Works like a charm on GT3e; if anyone interested to test on Arch, i attached
> the corresponding patched libva-intel-driver PKGFILE.
> 
Great :)
Comment 12 Josep Torra 2016-10-25 11:09:06 UTC
I tried it and it fixes the issue on the skull canyon too.

Please could you explain the change?

Does it introduce any performance penalty?

Are both MFX/FF units in GT3/GT4 still used with this change?
Comment 13 Sean V Kelley 2016-10-25 17:03:43 UTC
Haihao,

The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st VDBOX for HCP, VDENC, HuC. What we need for a permanent fix is an i915 kernel patch that manages the loads between the engines based on input from UMD.

So, while I find you patch servicable for the immediate need, I want to see a long term fix along the lines above that I suggest.  I will look into the kernel patch.

Thanks,

Sean
Comment 14 Sean V Kelley 2016-10-25 18:14:27 UTC
We also need to evaluate with KBL GT3+.  

Sean
Comment 15 Sean V Kelley 2016-10-25 20:57:34 UTC
The other reason I'm not too keen about this patch is that it is a band-aid to over-ride and use the flag everytime we add a new feature that is not shared between the two VDboxen.
Comment 16 haihao 2016-10-26 05:49:16 UTC
(In reply to Sean V Kelley from comment #13)
> Haihao,
> 
> The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would
> recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st
> VDBOX for HCP, VDENC, HuC.

Usually user doesn't use different codecs at the same time. I don't think using BSD1 only for MFX is better choice for VP8/H264/MPEG2 etc.

> What we need for a permanent fix is an i915
> kernel patch that manages the loads between the engines based on input from
> UMD.

Currently i915 kernel can manage the loads between the engines.  But i915 kernel doesn't know HCP/HuC commands must be ran from the 2nd ring unless UMD driver can tell the kernel. which is why I915_EXEC_BSD_RING1 and I915_EXEC_BSD_RING2 are added to the execution ioctl.

> 
> So, while I find you patch servicable for the immediate need, I want to see
> a long term fix along the lines above that I suggest.  I will look into the
> kernel patch.
> 
> Thanks,
> 
> Sean
Comment 17 haihao 2016-10-26 05:53:02 UTC
(In reply to Josep Torra from comment #12)
> I tried it and it fixes the issue on the skull canyon too.
> 
> Please could you explain the change?

The batchbuffer for VDEnc/HuC must be dispatched to the 1st VDBOX ring.

> 
> Does it introduce any performance penalty?

No. 

> Are both MFX/FF units in GT3/GT4 still used with this change?

Yes.
Comment 18 Sean V Kelley 2016-10-26 06:44:03 UTC
Created attachment 127548 [details]
attachment-31724-0.html

> On 25 DFómh 2016, at 22:49, bugzilla-daemon@freedesktop.org <mailto:bugzilla-daemon@freedesktop.org> wrote:
> 
> 
> Comment # 16 <https://bugs.freedesktop.org/show_bug.cgi?id=97872#c16> on bug 97872 <https://bugs.freedesktop.org/show_bug.cgi?id=97872> from haihao <mailto:haihao.xiang@intel.com>
> (In reply to Sean V Kelley from comment #13 <x-msg://4/show_bug.cgi?id=97872#c13>)
> > Haihao,
> > 
> > The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would
> > recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st
> > VDBOX for HCP, VDENC, HuC.
> 
> Usually user doesn't use different codecs at the same time. I don't think using
> BSD1 only for MFX is better choice for VP8/H264/MPEG2 etc.

You can’t assume that.  In fact, VDENC will likely dominate. 
> > What we need for a permanent fix is an i915
> > kernel patch that manages the loads between the engines based on input from
> > UMD.
> 
> Currently i915 kernel can manage the loads between the engines.  But i915
> kernel doesn't know HCP/HuC commands must be ran from the 2nd ring unless UMD
> driver can tell the kernel. which is why I915_EXEC_BSD_RING1 and
> I915_EXEC_BSD_RING2 are added to the execution ioctl.

Yes, that is the whole point of the patch submitted to the i915, but it is still a hack.  I’m well aware of how this works.  And that means every time we use a new codec that is not balanced between the VDBoxen we add the hack flag.  Again, we need to do better.

Sean
> > 
> > So, while I find you patch servicable for the immediate need, I want to see
> > a long term fix along the lines above that I suggest.  I will look into the
> > kernel patch.
> > 
> > Thanks,
> > 
> > Sean
> 
> You are receiving this mail because:
> You are the QA Contact for the bug.
Comment 19 haihao 2016-10-26 08:31:32 UTC
(In reply to Sean V Kelley from comment #14)
> We also need to evaluate with KBL GT3+.  

This patch doesn't touch any for code for KBL, so we don't need to evaluate with KBL, note VDEnc is not supported in the driver.
Comment 20 haihao 2016-10-26 08:43:09 UTC
> Yes, that is the whole point of the patch submitted to the i915, but it is
> still a hack.  I’m well aware of how this works.  And that means every time
> we use a new codec that is not balanced between the VDBoxen we add the hack
> flag.  Again, we need to do better.
> 

The batchbuffer for a new codec (MPEG2/JPEG/VC1/AVC/VP8) is still balanced between all VCS/VDBOX rings. The flag in the following function call is only available to the current batch buffer. 

intel_batchbuffer_start_atomic_bcs_override(batch, 0x1000, BSD_RING0);
Comment 21 haihao 2016-10-26 08:44:41 UTC
> note VDEnc is not supported in the driver.

Sorry, I mean VDEnc for KBL is not supported.
Comment 22 Sean V Kelley 2016-10-26 16:47:08 UTC
We do support VP9 and that is a part of HCP and that needs to be tested with KBL.
Comment 23 Sean V Kelley 2016-10-26 16:48:27 UTC
And we will support AVC VDENC with KBL, and again that will be needing evaluation.  So please also verify with KBL GT3+ for HCP based codecs.
Comment 24 Sean V Kelley 2016-10-26 16:49:49 UTC
(In reply to haihao from comment #20)
> > Yes, that is the whole point of the patch submitted to the i915, but it is
> > still a hack.  I’m well aware of how this works.  And that means every time
> > we use a new codec that is not balanced between the VDBoxen we add the hack
> > flag.  Again, we need to do better.
> > 
> 
> The batchbuffer for a new codec (MPEG2/JPEG/VC1/AVC/VP8) is still balanced
> between all VCS/VDBOX rings. The flag in the following function call is only
> available to the current batch buffer. 
> 
> intel_batchbuffer_start_atomic_bcs_override(batch, 0x1000, BSD_RING0);

Yes, I'm aware of that my point is that it is not "balanced" at all from a performance stand point.  And that is the kernel work I'm describing.  Regardless, that will not impact the immediate fix on this bug.

But I need you to test with KBL GT3+ for HCP based codecs.  Once we have VDENC AVC in place that too will need to be tested.
Comment 25 haihao 2016-10-27 01:00:17 UTC
This patch touches vdence only, for other hcp based codecs (HEVC/VP9 decoding/encoding), the 1st vcs/vdbox has already been used. Note each codecs in the driver has separate code path, this is why I don't think more testing is needed.
Comment 26 Sean V Kelley 2016-10-27 18:20:35 UTC
@haihao,

There are two issues:

1) We have to band aid every time we have new platform with different VDBOX codec use.

2) Load is not being balanced in actuallity.  We need to implement per-BB balancing. This will require KMD+libdrm+UMD changes.  The KMD and libdrm change is trivial.  

As I said, I'm fine with this patch, but it is not the long term solution.

So I will Ack your patch if you send it to the mailing list and we can merge it.
Comment 27 haihao 2016-10-28 03:06:43 UTC
@Sean

For vdenc, we have to always use the 1st ring, no matter it is GT1/GT2/GT3 or GT4, so the if () in the patch is not necessary, I will refine the patch and send it to the mailing list

As for the issues you mentioned, they are not related to this bug, I will discuss the issues with you in another post.
Comment 28 Sean V Kelley 2016-10-28 20:03:50 UTC
I merged it.  Thanks for the change.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.