85327 – [HSW] GPU hang on HSW Celeron when doing 16 VA-API decodes and compositing

Bug 85327 - [HSW] GPU hang on HSW Celeron when doing 16 VA-API decodes and compositing

Summary: [HSW] GPU hang on HSW Celeron when doing 16 VA-API decodes and compositing

Status:	CLOSED INVALID

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Simon Farnsworth
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-10-22 11:30 UTC by Simon Farnsworth
Modified:	2017-07-24 22:50 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	HSW
i915 features:	GPU hang

Attachments
Error state collected during hang (1.48 MB, application/octet-stream) 2014-10-22 11:31 UTC, Simon Farnsworth	no flags	Details
Error state after patch from comment #3 is applied (515.67 KB, application/octet-stream) 2014-10-22 14:20 UTC, Simon Farnsworth	no flags	Details
View All

Description Simon Farnsworth 2014-10-22 11:30:54 UTC

I'm using the kernel described in bug 83677 - basically #requests backported to 3.17.0. This may turn out to be a duplicate of that bug.

While it's now harder to get a GPU hang, I've had one with lots of VA-API video decodes. dmesg gives:

[ 2327.073906] [drm] stuck on render ring
[ 2327.082270] [drm] GPU HANG: ecode 0:0x8fdffff8, in screen_manager [903], reason: Ring hung, action: reset
[ 2327.082273] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2327.082274] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2327.082275] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2327.082276] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2327.082277] [drm] GPU crash dump saved to /sys/class/drm/card0/error

I'll attach the gzip'd error state.

Comment 1 Simon Farnsworth 2014-10-22 11:31:27 UTC

Created attachment 108231 [details]
Error state collected during hang

Comment 2 Chris Wilson 2014-10-22 12:21:08 UTC

Whilst this doesn't seem to be ppgtt related at first glance, you want to use i915.enable_ppgtt=1 with that kernel to prevent an eventual hang.

It also looks to be a different bug than bug 83677.

Comment 3 Chris Wilson 2014-10-22 12:30:00 UTC

On second thoughts, the symptom is slightly different, but it is still dying inside a context restore, but on a different command MEDIA_VFE_STATE instead of 3DSTATE_VF.

Try:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915
index 670edfd..4f4de1c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -507,6 +507,10 @@ mi_set_context(struct i915_gem_request *rq,
        if (IS_GEN6(rq->i915))
                rq->pending_flush |= I915_INVALIDATE_CACHES;
 
+       ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER);
+       if (ret)
+               return ret;
+
        len = 3;
        switch (INTEL_INFO(rq->i915)->gen) {
        case 8:

Comment 4 Simon Farnsworth 2014-10-22 13:46:47 UTC

The patch I'm testing is:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 841056c..0386721 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -434,6 +434,7 @@ mi_set_context(struct i915_gem_request *rq,
 {
        struct intel_ringbuffer *ring;
        int len;
+       int ret;
 
        /* w/a: If Flush TLB Invalidation Mode is enabled, driver must do a TLB
         * invalidation prior to MI_SET_CONTEXT. On GEN6 we don't set the value
@@ -443,6 +444,10 @@ mi_set_context(struct i915_gem_request *rq,
        if (IS_GEN6(rq->i915))
                rq->pending_flush |= I915_INVALIDATE_CACHES;
 
+       ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER);
+       if (ret)
+               return ret;
+
        len = 3;
        switch (INTEL_INFO(rq->i915)->gen) {
        case 8:

I've also set i915.enable_ppgtt=1 on the kernel command line. I'll let you know what I find.

Comment 5 Simon Farnsworth 2014-10-22 14:20:47 UTC

Created attachment 108234 [details]
Error state after patch from comment #3 is applied

No luck with that patch - it appears to simply move the deckchairs around again. New error state attached.

I note from the OSRC PRMs that your patch strictly speaking asks the GPU to do something that's claimed as not supported - you set DW1 bit 20 (CS stall), but not one of the 5 bits the OSRC PRMs claim you must also set (at least one of DW1 bits 12, 0, 1, 13 or 15:14 must be set).

Comment 6 Chris Wilson 2015-01-06 10:54:43 UTC

Hi Simon, any chance you can update the status of this bug? Maybe it was just a side-effect of the ctx restore bug afterall! I can wish.

At any rate, we have Ben's patch to try and Ben has been tackling further HSW GT1 issues in mesa.

Comment 7 Simon Farnsworth 2015-01-06 10:56:02 UTC

I need to free up some time to investigate this again, with Chris's context switch fix in place.

Comment 8 Jani Nikula 2016-04-21 12:09:34 UTC

(In reply to Simon Farnsworth from comment #7)
> I need to free up some time to investigate this again, with Chris's context
> switch fix in place.

After a little over a year, have you had time to check this, or should we just close the bug...?

Comment 9 Simon Farnsworth 2016-04-21 13:17:39 UTC

(In reply to Jani Nikula from comment #8)
> (In reply to Simon Farnsworth from comment #7)
> > I need to free up some time to investigate this again, with Chris's context
> > switch fix in place.
> 
> After a little over a year, have you had time to check this, or should we
> just close the bug...?

I didn't have the chance to investigate again before I left ONELAN and thus lost access to the hardware. I've marked the bug as INVALID, since I can't help further.

Comment 10 Jani Nikula 2016-04-22 07:20:43 UTC

(In reply to Simon Farnsworth from comment #9)
> I didn't have the chance to investigate again before I left ONELAN and thus
> lost access to the hardware. I've marked the bug as INVALID, since I can't
> help further.

Thanks for the follow-up, Simon!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.