Bug 85327

Summary: [HSW] GPU hang on HSW Celeron when doing 16 VA-API decodes and compositing
Product: DRI Reporter: Simon Farnsworth <simon>
Component: DRM/IntelAssignee: Simon Farnsworth <simon>
Status: CLOSED INVALID QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, przanoni
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: HSW i915 features: GPU hang
Attachments:
Description Flags
Error state collected during hang
none
Error state after patch from comment #3 is applied none

Description Simon Farnsworth 2014-10-22 11:30:54 UTC
I'm using the kernel described in bug 83677 - basically #requests backported to 3.17.0. This may turn out to be a duplicate of that bug.

While it's now harder to get a GPU hang, I've had one with lots of VA-API video decodes. dmesg gives:

[ 2327.073906] [drm] stuck on render ring
[ 2327.082270] [drm] GPU HANG: ecode 0:0x8fdffff8, in screen_manager [903], reason: Ring hung, action: reset
[ 2327.082273] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2327.082274] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2327.082275] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2327.082276] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2327.082277] [drm] GPU crash dump saved to /sys/class/drm/card0/error

I'll attach the gzip'd error state.
Comment 1 Simon Farnsworth 2014-10-22 11:31:27 UTC
Created attachment 108231 [details]
Error state collected during hang
Comment 2 Chris Wilson 2014-10-22 12:21:08 UTC
Whilst this doesn't seem to be ppgtt related at first glance, you want to use i915.enable_ppgtt=1 with that kernel to prevent an eventual hang.

It also looks to be a different bug than bug 83677.
Comment 3 Chris Wilson 2014-10-22 12:30:00 UTC
On second thoughts, the symptom is slightly different, but it is still dying inside a context restore, but on a different command MEDIA_VFE_STATE instead of 3DSTATE_VF.

Try:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915
index 670edfd..4f4de1c 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -507,6 +507,10 @@ mi_set_context(struct i915_gem_request *rq,
        if (IS_GEN6(rq->i915))
                rq->pending_flush |= I915_INVALIDATE_CACHES;
 
+       ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER);
+       if (ret)
+               return ret;
+
        len = 3;
        switch (INTEL_INFO(rq->i915)->gen) {
        case 8:
Comment 4 Simon Farnsworth 2014-10-22 13:46:47 UTC
The patch I'm testing is:

diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c
index 841056c..0386721 100644
--- a/drivers/gpu/drm/i915/i915_gem_context.c
+++ b/drivers/gpu/drm/i915/i915_gem_context.c
@@ -434,6 +434,7 @@ mi_set_context(struct i915_gem_request *rq,
 {
        struct intel_ringbuffer *ring;
        int len;
+       int ret;
 
        /* w/a: If Flush TLB Invalidation Mode is enabled, driver must do a TLB
         * invalidation prior to MI_SET_CONTEXT. On GEN6 we don't set the value
@@ -443,6 +444,10 @@ mi_set_context(struct i915_gem_request *rq,
        if (IS_GEN6(rq->i915))
                rq->pending_flush |= I915_INVALIDATE_CACHES;
 
+       ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER);
+       if (ret)
+               return ret;
+
        len = 3;
        switch (INTEL_INFO(rq->i915)->gen) {
        case 8:

I've also set i915.enable_ppgtt=1 on the kernel command line. I'll let you know what I find.
Comment 5 Simon Farnsworth 2014-10-22 14:20:47 UTC
Created attachment 108234 [details]
Error state after patch from comment #3 is applied

No luck with that patch - it appears to simply move the deckchairs around again. New error state attached.

I note from the OSRC PRMs that your patch strictly speaking asks the GPU to do something that's claimed as not supported - you set DW1 bit 20 (CS stall), but not one of the 5 bits the OSRC PRMs claim you must also set (at least one of DW1 bits 12, 0, 1, 13 or 15:14 must be set).
Comment 6 Chris Wilson 2015-01-06 10:54:43 UTC
Hi Simon, any chance you can update the status of this bug? Maybe it was just a side-effect of the ctx restore bug afterall! I can wish.

At any rate, we have Ben's patch to try and Ben has been tackling further HSW GT1 issues in mesa.
Comment 7 Simon Farnsworth 2015-01-06 10:56:02 UTC
I need to free up some time to investigate this again, with Chris's context switch fix in place.
Comment 8 Jani Nikula 2016-04-21 12:09:34 UTC
(In reply to Simon Farnsworth from comment #7)
> I need to free up some time to investigate this again, with Chris's context
> switch fix in place.

After a little over a year, have you had time to check this, or should we just close the bug...?
Comment 9 Simon Farnsworth 2016-04-21 13:17:39 UTC
(In reply to Jani Nikula from comment #8)
> (In reply to Simon Farnsworth from comment #7)
> > I need to free up some time to investigate this again, with Chris's context
> > switch fix in place.
> 
> After a little over a year, have you had time to check this, or should we
> just close the bug...?

I didn't have the chance to investigate again before I left ONELAN and thus lost access to the hardware. I've marked the bug as INVALID, since I can't help further.
Comment 10 Jani Nikula 2016-04-22 07:20:43 UTC
(In reply to Simon Farnsworth from comment #9)
> I didn't have the chance to investigate again before I left ONELAN and thus
> lost access to the hardware. I've marked the bug as INVALID, since I can't
> help further.

Thanks for the follow-up, Simon!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.