Summary: | GPU hang with libva (gstreamer) | ||
---|---|---|---|
Product: | libva | Reporter: | Florent Thiéry <florent.thiery> |
Component: | intel | Assignee: | haihao <haihao.xiang> |
Status: | CLOSED FIXED | QA Contact: | Sean V Kelley <seanvk> |
Severity: | normal | ||
Priority: | medium | CC: | bsreerenj, intel-gfx-bugs, n770galaxy |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | SKL | i915 features: | GPU hang |
Attachments: |
gpu crash dump
A patch to use BSD0 ring patched PKGFILE for libva-intel-driver on Arch attachment-31724-0.html |
Btw, running 4.7.4-1-ARCH Unfortunately, this bug was created under the graphics product. It should have been created for VAPPI instead. I'm going to move it. @haihao, let's take a look. Have you also tested with 4.8? Thanks, Sean Yes, just tested on Linux nuc6i5 4.8.4-1-ARCH #1 SMP PREEMPT Sat Oct 22 18:26:57 CEST 2016 x86_64 GNU/Linux with gstreamer-vaapi 1.9.90+1+g9414815-1 (In reply to Sean V Kelley from comment #4) > Have you also tested with 4.8? > > Thanks, > > Sean Hi Sean, I'm also reproducing this issue with the same hardware. I'd spent some time with it 3 weeks ago and was able to reproduce it in Ubuntu 16.04 with graphics stack updated via padoka PPA and updated kernels from mainline [2]. I'd tried with latest 4.8 RC then and the drm-intel-nightly. I'm planning to spend some time on this issue during by the end of this week. Please let me know which kernel do you want I'll have installed in the system. [1] https://launchpad.net/~paulo-miguel-dias/+archive/ubuntu/mesa?field.series_filter=xenial [2] http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-intel-nightly/ Created attachment 127525 [details] [review] A patch to use BSD0 ring Could you have a try with the attached patch? Works like a charm on GT3e; if anyone interested to test on Arch, i attached the corresponding patched libva-intel-driver PKGFILE. Josep, if you can test on GT4e that would be nice (don't have any access to this hw anymore). Created attachment 127534 [details]
patched PKGFILE for libva-intel-driver on Arch
Not seeing any regression on GT2 either (In reply to Florent Thiéry from comment #8) > Works like a charm on GT3e; if anyone interested to test on Arch, i attached > the corresponding patched libva-intel-driver PKGFILE. > Great :) I tried it and it fixes the issue on the skull canyon too. Please could you explain the change? Does it introduce any performance penalty? Are both MFX/FF units in GT3/GT4 still used with this change? Haihao, The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st VDBOX for HCP, VDENC, HuC. What we need for a permanent fix is an i915 kernel patch that manages the loads between the engines based on input from UMD. So, while I find you patch servicable for the immediate need, I want to see a long term fix along the lines above that I suggest. I will look into the kernel patch. Thanks, Sean We also need to evaluate with KBL GT3+. Sean The other reason I'm not too keen about this patch is that it is a band-aid to over-ride and use the flag everytime we add a new feature that is not shared between the two VDboxen. (In reply to Sean V Kelley from comment #13) > Haihao, > > The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would > recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st > VDBOX for HCP, VDENC, HuC. Usually user doesn't use different codecs at the same time. I don't think using BSD1 only for MFX is better choice for VP8/H264/MPEG2 etc. > What we need for a permanent fix is an i915 > kernel patch that manages the loads between the engines based on input from > UMD. Currently i915 kernel can manage the loads between the engines. But i915 kernel doesn't know HCP/HuC commands must be ran from the 2nd ring unless UMD driver can tell the kernel. which is why I915_EXEC_BSD_RING1 and I915_EXEC_BSD_RING2 are added to the execution ioctl. > > So, while I find you patch servicable for the immediate need, I want to see > a long term fix along the lines above that I suggest. I will look into the > kernel patch. > > Thanks, > > Sean (In reply to Josep Torra from comment #12) > I tried it and it fixes the issue on the skull canyon too. > > Please could you explain the change? The batchbuffer for VDEnc/HuC must be dispatched to the 1st VDBOX ring. > > Does it introduce any performance penalty? No. > Are both MFX/FF units in GT3/GT4 still used with this change? Yes. Created attachment 127548 [details] attachment-31724-0.html > On 25 DFómh 2016, at 22:49, bugzilla-daemon@freedesktop.org <mailto:bugzilla-daemon@freedesktop.org> wrote: > > > Comment # 16 <https://bugs.freedesktop.org/show_bug.cgi?id=97872#c16> on bug 97872 <https://bugs.freedesktop.org/show_bug.cgi?id=97872> from haihao <mailto:haihao.xiang@intel.com> > (In reply to Sean V Kelley from comment #13 <x-msg://4/show_bug.cgi?id=97872#c13>) > > Haihao, > > > > The 2nd VDBOX on SKL is not a complete VDBOX, it only contains MFX. I would > > recommend shunting all MFX workloads to the 2nd VDBOX and using the 1st > > VDBOX for HCP, VDENC, HuC. > > Usually user doesn't use different codecs at the same time. I don't think using > BSD1 only for MFX is better choice for VP8/H264/MPEG2 etc. You can’t assume that. In fact, VDENC will likely dominate. > > What we need for a permanent fix is an i915 > > kernel patch that manages the loads between the engines based on input from > > UMD. > > Currently i915 kernel can manage the loads between the engines. But i915 > kernel doesn't know HCP/HuC commands must be ran from the 2nd ring unless UMD > driver can tell the kernel. which is why I915_EXEC_BSD_RING1 and > I915_EXEC_BSD_RING2 are added to the execution ioctl. Yes, that is the whole point of the patch submitted to the i915, but it is still a hack. I’m well aware of how this works. And that means every time we use a new codec that is not balanced between the VDBoxen we add the hack flag. Again, we need to do better. Sean > > > > So, while I find you patch servicable for the immediate need, I want to see > > a long term fix along the lines above that I suggest. I will look into the > > kernel patch. > > > > Thanks, > > > > Sean > > You are receiving this mail because: > You are the QA Contact for the bug. (In reply to Sean V Kelley from comment #14) > We also need to evaluate with KBL GT3+. This patch doesn't touch any for code for KBL, so we don't need to evaluate with KBL, note VDEnc is not supported in the driver.
> Yes, that is the whole point of the patch submitted to the i915, but it is
> still a hack. I’m well aware of how this works. And that means every time
> we use a new codec that is not balanced between the VDBoxen we add the hack
> flag. Again, we need to do better.
>
The batchbuffer for a new codec (MPEG2/JPEG/VC1/AVC/VP8) is still balanced between all VCS/VDBOX rings. The flag in the following function call is only available to the current batch buffer.
intel_batchbuffer_start_atomic_bcs_override(batch, 0x1000, BSD_RING0);
> note VDEnc is not supported in the driver.
Sorry, I mean VDEnc for KBL is not supported.
We do support VP9 and that is a part of HCP and that needs to be tested with KBL. And we will support AVC VDENC with KBL, and again that will be needing evaluation. So please also verify with KBL GT3+ for HCP based codecs. (In reply to haihao from comment #20) > > Yes, that is the whole point of the patch submitted to the i915, but it is > > still a hack. I’m well aware of how this works. And that means every time > > we use a new codec that is not balanced between the VDBoxen we add the hack > > flag. Again, we need to do better. > > > > The batchbuffer for a new codec (MPEG2/JPEG/VC1/AVC/VP8) is still balanced > between all VCS/VDBOX rings. The flag in the following function call is only > available to the current batch buffer. > > intel_batchbuffer_start_atomic_bcs_override(batch, 0x1000, BSD_RING0); Yes, I'm aware of that my point is that it is not "balanced" at all from a performance stand point. And that is the kernel work I'm describing. Regardless, that will not impact the immediate fix on this bug. But I need you to test with KBL GT3+ for HCP based codecs. Once we have VDENC AVC in place that too will need to be tested. This patch touches vdence only, for other hcp based codecs (HEVC/VP9 decoding/encoding), the 1st vcs/vdbox has already been used. Note each codecs in the driver has separate code path, this is why I don't think more testing is needed. @haihao, There are two issues: 1) We have to band aid every time we have new platform with different VDBOX codec use. 2) Load is not being balanced in actuallity. We need to implement per-BB balancing. This will require KMD+libdrm+UMD changes. The KMD and libdrm change is trivial. As I said, I'm fine with this patch, but it is not the long term solution. So I will Ack your patch if you send it to the mailing list and we can merge it. @Sean For vdenc, we have to always use the 1st ring, no matter it is GT1/GT2/GT3 or GT4, so the if () in the patch is not necessary, I will refine the patch and send it to the mailing list As for the issues you mentioned, they are not related to this bug, I will discuss the issues with you in another post. I merged it. Thanks for the change. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 126659 [details] gpu crash dump hw platforms: - Skylake i7 NUC6i7KYK (GPU Iris Pro 580) - Skylake i5 NUC6i5SYK (GPU Iris Graphics 540) The problem does not happen with Skylake i3 NUC6i3SYL (GPU Iris HD 520); it happens both headless and under xorg libdrm: 2.4.70 Running the following command twice in a row (quickly) results in a GPU reset (gstreamer master required): gst-launch-1.0 videotestsrc num-buffers=100 ! video/x-raw\,\ format\=\(string\)I420\,\ width\=\(int\)1920\,\ height\=\(int\)1080\,\ framerate\=\(fraction\)30/1 ! vaapih264enc tune=low-power ! fakesink silent=false -v [ 613.290101] [drm] stuck on bsd2 ring [ 613.290989] [drm] GPU HANG: ecode 9:3:0xcb79ffc4, in videotestsrc0:s [2622], reason: Engine(s) hung, action: reset [ 613.290992] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 613.290995] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 613.290997] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 613.290999] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 613.291001] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 613.293290] drm/i915: Resetting chip after gpu hang