90953 – [IVB] GPU hang (possibly triggered by VA-API)

Bug 90953 - [IVB] GPU hang (possibly triggered by VA-API)

Summary: [IVB] GPU hang (possibly triggered by VA-API)

Status:	CLOSED WORKSFORME

Alias:	None

Product:	libva
Classification:	Unclassified
Component:	intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	ykzhao
QA Contact:	Sean V Kelley

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-06-12 09:11 UTC by Simon Farnsworth
Modified:	2016-11-03 21:08 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Error state collected by the kernel (1003.25 KB, application/octet-stream) 2015-06-12 09:12 UTC, Simon Farnsworth	Details
Error state with libva and intel-driver git master (659.88 KB, application/octet-stream) 2015-06-15 17:44 UTC, Simon Farnsworth	Details
View All

Description Simon Farnsworth 2015-06-12 09:11:18 UTC

Using Fedora 21, kernel 3.18.9-100 on x86_64, and running for a longer period, I see a GPU hang.

Userspace components in use are:

From Fedora:
xorg-x11-drv-intel-2.21.15-9.fc20.x86_64
mesa-dri-drivers-10.3.3-1.20141110.fc20.x86_64
libdrm-2.4.58-1.fc20.x86_64

Compiled by me:
libva-1.4.1
libva-intel-driver-1.4.2-0.git8e34fb34

Comment 1 Simon Farnsworth 2015-06-12 09:12:09 UTC

Created attachment 116454 [details]
Error state collected by the kernel

Error state collected by the kernel; the GPU is whatever's inside an Intel(R) Celeron(R) CPU 1037U @ 1.80GHz

Comment 2 Chris Wilson 2015-06-12 18:54:05 UTC

Yes, looks like libva jumps into a malformed batch. Over to libva for better analysis.

Comment 3 Sean V Kelley 2015-06-12 22:57:07 UTC

What are you actually executing with libva/intel-driver?

Also, retest with libva/intel-driver master.

Comment 4 Simon Farnsworth 2015-06-14 12:58:15 UTC

(In reply to Sean V Kelley from comment #3)
> What are you actually executing with libva/intel-driver?
> 
> Also, retest with libva/intel-driver master.

I'm using gstreamer1-vaapi to decode and deinterlace a 1920x1080 MPEG-2 MP@HL video. I can get the gstreamer1-vaapi version on Monday, together with any other data you need; we're negotiating with the customer to get permission to send you the video, too.

I'll retest with git master as well, and see if the same fault occurs; it's non-deterministic, and takes 24-48 hours to trigger.

Comment 5 Simon Farnsworth 2015-06-15 09:30:28 UTC

(In reply to Sean V Kelley from comment #3)
> What are you actually executing with libva/intel-driver?
> 
> Also, retest with libva/intel-driver master.

I can't easily test with just intel-driver(In reply to Simon Farnsworth from comment #4)
> (In reply to Sean V Kelley from comment #3)
> > What are you actually executing with libva/intel-driver?
> > 
> > Also, retest with libva/intel-driver master.
> 
> I'm using gstreamer1-vaapi to decode and deinterlace a 1920x1080 MPEG-2
> MP@HL video. I can get the gstreamer1-vaapi version on Monday, together with
> any other data you need; we're negotiating with the customer to get
> permission to send you the video, too.
> 
> I'll retest with git master as well, and see if the same fault occurs; it's
> non-deterministic, and takes 24-48 hours to trigger.

GStreamer is at 1.4.3; gstreamer-vaapi is at 0.5.10

Comment 6 Simon Farnsworth 2015-06-15 17:44:03 UTC

(In reply to Sean V Kelley from comment #3)
> What are you actually executing with libva/intel-driver?
> 
> Also, retest with libva/intel-driver master.

I've brought libva to:

commit 5d07b29687db6d17811b7ecf9b779377e9851a27
Author: Xiang, Haihao <haihao.xiang@intel.com>
Date:   Wed Jun 10 14:41:14 2015 +0800

    test/decode/tinyjpeg: make sure the pointer is valid before dereferencing it
    
    Signed-off-by: Xiang, Haihao <haihao.xiang@intel.com>
    Reviewed-by: Sean V Kelley <seanvk@posteo.de>
    (cherry picked from commit 8455834161bab3374fe9756fd4a28d919027daf7)


and intel-driver to:

commit e797089446c1f5b71b239b9046d76e054dfcba59
Author: Zhong Li <zhong.li@intel.com>
Date:   Mon Jun 8 12:42:21 2015 +0800

    VP8 HWEnc: Modify qp threshold value for mode cost calculatation
    
    The patch is helpful to improve quality when qp is lower than the
    threshold value.
    
    Signed-off-by: Zhong Li <zhong.li@intel.com>

I've seen a hang - I'll attach the error state in case it's different.

I'm using vaapidecode ! queue ! vaapipostproc ! vaapisink with suitable parameters set (MCDI if possible) to handle video. The queue is set to take up to 6 frames of video from vaapidecode.

Comment 7 Simon Farnsworth 2015-06-15 17:44:35 UTC

Created attachment 116522 [details]
Error state with libva and intel-driver git master

Comment 8 haihao 2015-06-24 08:08:58 UTC

Could you provide the full command line and the sample video if possible ?

Comment 9 Simon Farnsworth 2015-06-24 10:45:16 UTC

(In reply to haihao from comment #8)
> Could you provide the full command line and the sample video if possible ?

It reproduces erratically with:

gst-launch-1.0 filesrc location=01_Work4U_Finding.mpeg ! decodebin name=db max-size-bytes=$((16 * 1024 * 1024)) expose-all-streams=false ! queue max-size-buffers=0 max-size-bytes=0 max-size-time=$((200 * 1000 * 1000)) ! audioconvert dithering=tpdf-hf ! audioresample quality=10 sinc-filter-mode=full ! alsasink qos=true max-lateness=$((20 * 1000 * 1000)) db. ! queue max-size-buffers=6 max-size-bytes=0 max-size-time=$((200 * 1000 * 1000)) ! vaapipostproc force-aspect-ratio=false deinterlace-method=motion-compensated deinterlace-mode=auto ! vaapisink force-aspect-ratio=false show-preroll-frame=false max-lateness=$((20 * 1000 * 1000))

If I change deinterlace-mode on vaapipostproc to deinterlace-mode=disabled, the error disappears. Similarly, if I re-encode as progressive instead of interlaced, the error disappears.

I'm getting customer permission to send you 01_Work4U_Finding.mpeg - I may have to send a link to your intel.com e-mail privately, depending on the customer's attitude.

Comment 10 Simon Farnsworth 2015-07-02 09:08:32 UTC

(In reply to haihao from comment #8)
> Could you provide the full command line and the sample video if possible ?

Sample video link sent by e-mail; my customer does not want it shared publicly.

Comment 11 Simon Farnsworth 2015-08-27 12:25:20 UTC

(In reply to haihao from comment #8)
> Could you provide the full command line and the sample video if possible ?

Have you got anywhere investigating this?

I'm stalled at my end - the only thing I can envisage given what I've found so far is some form of erratum with large batch buffers.

Comment 12 haihao 2015-09-07 05:09:06 UTC

(In reply to Simon Farnsworth from comment #11)
> (In reply to haihao from comment #8)
> > Could you provide the full command line and the sample video if possible ?
> 
> Have you got anywhere investigating this?

No. I can't reproduce this issue on my machine. I remember I replied you in an email.

> 
> I'm stalled at my end - the only thing I can envisage given what I've found
> so far is some form of erratum with large batch buffers.

Comment 13 Sean V Kelley 2015-09-10 17:16:05 UTC

As Haihao mentioned, we are unable to reproduce your specific issue, which appears to be specific to your configuration.  We see no such GPU hangs on decode and deinterlace of a 1920x1080 MPEG-2 MP@HL video for IVB.

Sean

Comment 14 Simon Farnsworth 2015-09-10 21:49:16 UTC

(In reply to Sean V Kelley from comment #13)
> As Haihao mentioned, we are unable to reproduce your specific issue, which
> appears to be specific to your configuration.  We see no such GPU hangs on
> decode and deinterlace of a 1920x1080 MPEG-2 MP@HL video for IVB.
> 
> Sean

And as I've mentioned to Haihao, I'm happy to send a complete system with a driver build environment that reproduces the problem to an address of Intel's choosing.

Comment 15 Simon Farnsworth 2015-10-12 13:10:28 UTC

I've had the chance to do a bit more work on this.

It looks like VA-API doesn't give the kernel the right hints for tracking GEM object dirty state; in the kernel, if I change i915_gem_execbuffer_move_to_active to unconditionally set obj->dirty=1 (instead of only setting it if obj->base.write_domain is true), the hang goes away:

--- a/drivers/gpu/drm/i915/i915_gem_execbuffer.c
+++ b/drivers/gpu/drm/i915/i915_gem_execbuffer.c
@@ -1032,6 +1032,7 @@ i915_gem_execbuffer_move_to_active(struct list_head *vmas,
                u32 old_read = obj->base.read_domains;
                u32 old_write = obj->base.write_domain;
 
+               obj->dirty = 1;
                obj->base.write_domain = obj->base.pending_write_domain;
                if (obj->base.write_domain == 0)
                        obj->base.pending_read_domains |= obj->base.read_domains;
@@ -1039,7 +1040,6 @@ i915_gem_execbuffer_move_to_active(struct list_head *vmas,
 
                i915_vma_move_to_active(vma, req);
                if (obj->base.write_domain) {
-                       obj->dirty = 1;
                        i915_gem_request_assign(&obj->last_write_req, req);
 
                        intel_fb_obj_invalidate(obj, ORIGIN_CS);

I assume that this means that you're not calling the SET_DOMAIN ioctl() at appropriate points, but relying on the kernel doing the right thing for you anyway.

Comment 16 haihao 2015-10-30 04:46:30 UTC

Yes, the SET_DOMAIN ioctl() is called explicitly in the driver, The driver calls drm_intel_gem_bo_map()/drm_intel_gem_bo_map_gtt()/drm_intel_bo_emit_reloc() to change DOMAIN setting.

Comment 17 haihao 2015-10-30 04:48:00 UTC

(In reply to haihao from comment #16)
> Yes, the SET_DOMAIN ioctl() is called explicitly in the driver, The driver
> calls
> drm_intel_gem_bo_map()/drm_intel_gem_bo_map_gtt()/drm_intel_bo_emit_reloc()
> to change DOMAIN setting.

Sorry, 

The SET_DOMAIN ioctl() *isn't* called explicitly in the driver, The driver
calls drm_intel_gem_bo_map()/drm_intel_gem_bo_map_gtt()/drm_intel_bo_emit_reloc()
to change DOMAIN setting.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.