Bug 109629

Summary: GPU hang on Haswell while encoding and decoding video
Product: DRI Reporter: Michael Olbrich <m.olbrich>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED NOTOURBUG QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: chris, intel-gfx-bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard: Triaged
i915 platform: HSW i915 features: GPU hang
Description Flags
/sys/class/drm/card0/error for drm-intel-fixes-2019-02-13
/sys/class/drm/card0/error with iommu disabled none

Description Michael Olbrich 2019-02-14 11:05:05 UTC
Created attachment 143378 [details]
/sys/class/drm/card0/error for drm-intel-fixes-2019-02-13

I'm getting GPU hangs while encoding & decoding H.264 video with vaapi. I'm not quite sure what triggers the issue. It might be just starting or stopping, resolution changes or corruptions in H.264 stream while decoding.

I've reproduced this with the following Kernel versions:
- 4.20.x
- 5.0-rc6
- drm-intel-next-2019-02-07
- drm-intel-fixes-2019-02-13

It does not occur with 4.19.x and older.
Git bisect gives me 79556df293b2efbb3ccebb6db02120d62e348b44 first bad commit. On Haswell this changes the default for ppgtt from 1 (aliasing) to 2 (full). If I set enable_ppgtt=1 on 4.20.x then the problem is gone.

I've attached the content of /sys/class/drm/card0/error for drm-intel-fixes-2019-02-13.
Comment 1 Chris Wilson 2019-02-14 11:11:32 UTC
Looks like a userspace bug though; it doesn't match the expected typical error for invalid TLB (due to a bad mm switch). Batch was submitted by vaapi, so double check you've pulled the latest libva, and raise a bug with them (they have historically forgotten to setup their SBA correctly and such use-after-free bugs in their cmdbuffers...)

One other thing to double check kernel side is disabling iommu -- although the error state doesn't indicate that to be a problem, just useful to rule out that as a source of memory latency / incoherency / missed flushed|invalidate.
Comment 2 Lakshmi 2019-02-22 11:27:47 UTC
Michael, have you tried by disabling iommu?
Comment 3 Lakshmi 2019-03-07 13:17:42 UTC
Michael, any updates here?
Comment 4 Michael Olbrich 2019-03-07 15:14:13 UTC
Sorry for the delay. Disabling the iommu was not that simple because it was needed elsewhere in the system. So it was a bit more work than just disabling it via kernel command-line.

Anyways, I still get the GPU hangs with the iommu disabled.
Comment 5 Michael Olbrich 2019-03-07 15:15:14 UTC
Created attachment 143571 [details]
/sys/class/drm/card0/error with iommu disabled

This is the error dump for 4.20.x. Anything else I should test?
Comment 6 Lakshmi 2019-03-27 11:53:11 UTC
Please create a libva issue here

Closing this issue as NOTOURBUG.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.