Bug 112377

Summary: [TGL] media VME encoding GPU hang w/o i915 error state captured
Product: DRI Reporter: Dmitry Rogozhkin <dmitry.v.rogozhkin>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: highest CC: chris, eero.t.tamminen, francesco.balestrieri, intel-gfx-bugs, james.ausmus, mika.kuoppala, tony.ye, tvrtko.ursulin
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: Triaged, ReadyForDev
i915 platform: TGL i915 features: GPU hang
Attachments:
Description Flags
Full dmesg log
none
drm-tip bisect
none
i915_error_state_904ce198
none
dmesg_904ce198 none

Description Dmitry Rogozhkin 2019-11-22 23:31:23 UTC
Created attachment 146016 [details]
Full dmesg log

From https://github.com/intel/media-driver/issues/773

Run:
wget https://fate-suite.libav.org/h264-conformance/AUD_MW_E.264
sample_multi_transcode -i::h264 AUD_MW_E.264 -hw -async 1 -u 4 -o::h264 a.h264

Stack:
* https://github.com/intel/gmmlib/commit/f78be970a6c3aef6d0347159f9c3f250421af16c
* https://github.com/intel/media-driver/commit/1645d0f06597599393625168cc4445b0ae092219
* https://github.com/Intel-Media-SDK/MediaSDK/commit/2515d8fbb65979685ce086aa5c9b24786cc2cab6 + apply https://github.com/Intel-Media-SDK/MediaSDK/pull/1771

Ran with latest drm-tip kernel:
commit 5bbbc0061acc528705e593d7e01c4c9c40b208db
Merge: e67c139 883d955
Author: Lyude Paul <lyude@redhat.com>
Date:   Fri Nov 22 14:12:54 2019 -0500

    Merge remote-tracking branch 'drm-intel/topic/core-for-CI' into drm-tip


Essential part of dmesg (full one attached with drm.debug=0x1e):
[   77.006161] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[   77.006204] i915 0000:00:02.0: sample_multi_tr[1951] context reset due to GPU hang
[   77.006293] [drm:__i915_request_reset [i915]] client sample_multi_tr[1951]: gained 1 ban score, now 1

i915 error state: EMPTY


Can you, please, help debug the issue from kmd stand point?

Why i915 error state is missing? is it real GPU hang or i915 bug?

Note: the above media encoder works on RCS and VCS rings, VCS tasks depend on RCS ones.
Comment 1 Chris Wilson 2019-11-22 23:43:02 UTC
It's a forced preemption; reset from softirq context prohibits error capture and would further compromise the QoS of the more important task.
Comment 2 Dmitry Rogozhkin 2019-11-22 23:55:30 UTC
Wouldn't you mind to explain what is a forced preemption and suggest why forced preemption had happened?

Is that possible to disable it to still be able to get error state? Is that possible to force getting error state?
Comment 3 Chris Wilson 2019-11-23 10:21:18 UTC
Forced preemption occurs when a lower priority batch does not yield the GPU to a higher preemption request [which is a denial-of-service from one and a quality-of-service guarantee for the other], over a certain timeout. See DRM_I915_PREEMPT_TIMEOUT. The plan is to have that property on the engine, but that is waiting for you to ack, so currently the override is i915.reset=2.
Comment 4 Dmitry Rogozhkin 2019-11-23 15:59:57 UTC
>> Forced preemption occurs when a lower priority batch

Thank you for clarification. However, this does not make any sense for my use case. I believe that this was the single workload running on the system. I even tried with i915.disable_display=1 - same result.

Chris, can you, please, suggest whether any issue in i915 can lead to this behavior?
Comment 5 Eero Tamminen 2019-11-25 12:02:05 UTC
(In reply to Chris Wilson from comment #3)
> Forced preemption occurs when a lower priority batch does not yield the GPU
> to a higher preemption request [which is a denial-of-service from one and a
> quality-of-service guarantee for the other], over a certain timeout.

What is higher pre-emption request in this case?

(Maybe dmesg output could identify also the higher priority process?)


(In reply to Chris Wilson from comment #1)
> It's a forced preemption; reset from softirq context prohibits error capture
> and would further compromise the QoS of the more important task.

Is there some option to enable error capture also in that case?

(Otherwise debugging those issues would become really hard...)
Comment 6 Chris Wilson 2019-11-25 15:33:37 UTC
(In reply to Eero Tamminen from comment #5)
> (In reply to Chris Wilson from comment #3)
> > Forced preemption occurs when a lower priority batch does not yield the GPU
> > to a higher preemption request [which is a denial-of-service from one and a
> > quality-of-service guarantee for the other], over a certain timeout.
> 
> What is higher pre-emption request in this case?

priority.

> (Maybe dmesg output could identify also the higher priority process?)

It's not the one at fault, it can be anything including the heartbeat.

 
> (In reply to Chris Wilson from comment #1)
> > It's a forced preemption; reset from softirq context prohibits error capture
> > and would further compromise the QoS of the more important task.
> 
> Is there some option to enable error capture also in that case?

The option is to turn off this mode. The means to do that are likely to change when we add more knobs other than modparams.

> (Otherwise debugging those issues would become really hard...)
Comment 7 Dmitry Rogozhkin 2019-11-25 15:42:34 UTC
Is anything else needed from submission side to debug the issue on i915 side?
Comment 8 Dmitry Rogozhkin 2019-11-26 01:53:42 UTC
Issue can be reproduced both on CI_DRM_7350 and CI_DRM_7420 kernel builds.
Comment 9 Tvrtko Ursulin 2019-11-26 09:30:26 UTC
Are media batches still non-preemptible on TGL? But do you expect a single one to run for more than 100ms?

Otherwise I think is trying to relax the timeout with https://patchwork.freedesktop.org/series/69992/ so perhaps give it a spin.
Comment 10 Dmitry Rogozhkin 2019-11-26 15:48:24 UTC
The command line I provided uses the smallest possible video. So, batches should be executed instantaneous, much less than 100ms. Actually I would like to point out 2 issues here:
1. Media fails on the key use case
2. i915 driver hides GPU hang and does not provide a way to get the error dump

The first one is critical for TGL program.

The second one is critical in a longer term - how we are supposed to debug hangs? I strongly suggest to either change/fix the behavior and allow dumps or introduce special i915 module parameter to again allow dumps. In case of a parameter I would suggest to print message into dmesg log to suggest user rerun workload with this parameter and capture dump.
Comment 11 Tvrtko Ursulin 2019-11-26 16:06:51 UTC
Can you please try with i915.reset=2?
Comment 12 Dmitry Rogozhkin 2019-11-27 08:57:19 UTC
This occurs to be a regression in drm-tip or a change which should be reflected on user space as well. Git bisect points to the following commit. Reverting it makes the use case pass. @Mika: can you, please, comment why the change could break media usage?

commit 08fff7aeddc9dd72161b4c8fc27fbab12b4b9352
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Tue Oct 15 18:44:49 2019 +0300

    drm/i915/tgl: Wa_1607138340

    Avoid possible cs hang with semaphores by disabling
    lite restore.

    Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191015154449.10338-11-mika.kuoppala@linux.intel.com



I am attaching:
1. Git bisect log
2. i915 error state dumped from kernel @904ce198 (this kernel does not have preemption timeout)
Comment 13 Dmitry Rogozhkin 2019-11-27 08:58:05 UTC
Created attachment 146030 [details]
drm-tip bisect
Comment 14 Dmitry Rogozhkin 2019-11-27 08:58:33 UTC
Created attachment 146031 [details]
i915_error_state_904ce198
Comment 15 Dmitry Rogozhkin 2019-11-27 08:59:22 UTC
Created attachment 146032 [details]
dmesg_904ce198
Comment 16 Lakshmi 2019-11-28 09:11:19 UTC
Setting the priority of this bug to highest considering it's a regression.
Comment 17 zayiamariya 2019-11-29 11:40:26 UTC
I have read the report about the VME encoding GPU you have shared. I think it is very difficult to fix the error. My system is also facing the issue. I tried many ways to solve it. https://organizetechnologies.com/  But it doesn't work.
Comment 18 Martin Peres 2019-11-29 19:52:08 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/643.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.