Bug 109629 - GPU hang on Haswell while encoding and decoding video
Summary: GPU hang on Haswell while encoding and decoding video
Status: CLOSED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-14 11:05 UTC by Michael Olbrich
Modified: 2019-03-27 11:53 UTC (History)
2 users (show)

See Also:
i915 platform: HSW
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error for drm-intel-fixes-2019-02-13 (53.64 KB, text/plain)
2019-02-14 11:05 UTC, Michael Olbrich
no flags Details
/sys/class/drm/card0/error with iommu disabled (42.79 KB, text/plain)
2019-03-07 15:15 UTC, Michael Olbrich
no flags Details

Description Michael Olbrich 2019-02-14 11:05:05 UTC
Created attachment 143378 [details]
/sys/class/drm/card0/error for drm-intel-fixes-2019-02-13

I'm getting GPU hangs while encoding & decoding H.264 video with vaapi. I'm not quite sure what triggers the issue. It might be just starting or stopping, resolution changes or corruptions in H.264 stream while decoding.

I've reproduced this with the following Kernel versions:
- 4.20.x
- 5.0-rc6
- drm-intel-next-2019-02-07
- drm-intel-fixes-2019-02-13

It does not occur with 4.19.x and older.
Git bisect gives me 79556df293b2efbb3ccebb6db02120d62e348b44 first bad commit. On Haswell this changes the default for ppgtt from 1 (aliasing) to 2 (full). If I set enable_ppgtt=1 on 4.20.x then the problem is gone.

I've attached the content of /sys/class/drm/card0/error for drm-intel-fixes-2019-02-13.
Comment 1 Chris Wilson 2019-02-14 11:11:32 UTC
Looks like a userspace bug though; it doesn't match the expected typical error for invalid TLB (due to a bad mm switch). Batch was submitted by vaapi, so double check you've pulled the latest libva, and raise a bug with them (they have historically forgotten to setup their SBA correctly and such use-after-free bugs in their cmdbuffers...)

One other thing to double check kernel side is disabling iommu -- although the error state doesn't indicate that to be a problem, just useful to rule out that as a source of memory latency / incoherency / missed flushed|invalidate.
Comment 2 Lakshmi 2019-02-22 11:27:47 UTC
Michael, have you tried by disabling iommu?
Comment 3 Lakshmi 2019-03-07 13:17:42 UTC
Michael, any updates here?
Comment 4 Michael Olbrich 2019-03-07 15:14:13 UTC
Sorry for the delay. Disabling the iommu was not that simple because it was needed elsewhere in the system. So it was a bit more work than just disabling it via kernel command-line.

Anyways, I still get the GPU hangs with the iommu disabled.
Comment 5 Michael Olbrich 2019-03-07 15:15:14 UTC
Created attachment 143571 [details]
/sys/class/drm/card0/error with iommu disabled

This is the error dump for 4.20.x. Anything else I should test?
Comment 6 Lakshmi 2019-03-27 11:53:11 UTC
Please create a libva issue here
https://github.com/intel/libva/issues

Closing this issue as NOTOURBUG.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.