I'm using the kernel described in bug 83677 - basically #requests backported to 3.17.0. This may turn out to be a duplicate of that bug. While it's now harder to get a GPU hang, I've had one with lots of VA-API video decodes. dmesg gives: [ 2327.073906] [drm] stuck on render ring [ 2327.082270] [drm] GPU HANG: ecode 0:0x8fdffff8, in screen_manager [903], reason: Ring hung, action: reset [ 2327.082273] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 2327.082274] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 2327.082275] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 2327.082276] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 2327.082277] [drm] GPU crash dump saved to /sys/class/drm/card0/error I'll attach the gzip'd error state.
Created attachment 108231 [details] Error state collected during hang
Whilst this doesn't seem to be ppgtt related at first glance, you want to use i915.enable_ppgtt=1 with that kernel to prevent an eventual hang. It also looks to be a different bug than bug 83677.
On second thoughts, the symptom is slightly different, but it is still dying inside a context restore, but on a different command MEDIA_VFE_STATE instead of 3DSTATE_VF. Try: diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915 index 670edfd..4f4de1c 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -507,6 +507,10 @@ mi_set_context(struct i915_gem_request *rq, if (IS_GEN6(rq->i915)) rq->pending_flush |= I915_INVALIDATE_CACHES; + ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER); + if (ret) + return ret; + len = 3; switch (INTEL_INFO(rq->i915)->gen) { case 8:
The patch I'm testing is: diff --git a/drivers/gpu/drm/i915/i915_gem_context.c b/drivers/gpu/drm/i915/i915_gem_context.c index 841056c..0386721 100644 --- a/drivers/gpu/drm/i915/i915_gem_context.c +++ b/drivers/gpu/drm/i915/i915_gem_context.c @@ -434,6 +434,7 @@ mi_set_context(struct i915_gem_request *rq, { struct intel_ringbuffer *ring; int len; + int ret; /* w/a: If Flush TLB Invalidation Mode is enabled, driver must do a TLB * invalidation prior to MI_SET_CONTEXT. On GEN6 we don't set the value @@ -443,6 +444,10 @@ mi_set_context(struct i915_gem_request *rq, if (IS_GEN6(rq->i915)) rq->pending_flush |= I915_INVALIDATE_CACHES; + ret = i915_request_emit_flush(rq, I915_COMMAND_BARRIER); + if (ret) + return ret; + len = 3; switch (INTEL_INFO(rq->i915)->gen) { case 8: I've also set i915.enable_ppgtt=1 on the kernel command line. I'll let you know what I find.
Created attachment 108234 [details] Error state after patch from comment #3 is applied No luck with that patch - it appears to simply move the deckchairs around again. New error state attached. I note from the OSRC PRMs that your patch strictly speaking asks the GPU to do something that's claimed as not supported - you set DW1 bit 20 (CS stall), but not one of the 5 bits the OSRC PRMs claim you must also set (at least one of DW1 bits 12, 0, 1, 13 or 15:14 must be set).
Hi Simon, any chance you can update the status of this bug? Maybe it was just a side-effect of the ctx restore bug afterall! I can wish. At any rate, we have Ben's patch to try and Ben has been tackling further HSW GT1 issues in mesa.
I need to free up some time to investigate this again, with Chris's context switch fix in place.
(In reply to Simon Farnsworth from comment #7) > I need to free up some time to investigate this again, with Chris's context > switch fix in place. After a little over a year, have you had time to check this, or should we just close the bug...?
(In reply to Jani Nikula from comment #8) > (In reply to Simon Farnsworth from comment #7) > > I need to free up some time to investigate this again, with Chris's context > > switch fix in place. > > After a little over a year, have you had time to check this, or should we > just close the bug...? I didn't have the chance to investigate again before I left ONELAN and thus lost access to the hardware. I've marked the bug as INVALID, since I can't help further.
(In reply to Simon Farnsworth from comment #9) > I didn't have the chance to investigate again before I left ONELAN and thus > lost access to the hardware. I've marked the bug as INVALID, since I can't > help further. Thanks for the follow-up, Simon!
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.