Created attachment 146016 [details]
Full dmesg log
sample_multi_transcode -i::h264 AUD_MW_E.264 -hw -async 1 -u 4 -o::h264 a.h264
* https://github.com/Intel-Media-SDK/MediaSDK/commit/2515d8fbb65979685ce086aa5c9b24786cc2cab6 + apply https://github.com/Intel-Media-SDK/MediaSDK/pull/1771
Ran with latest drm-tip kernel:
Merge: e67c139 883d955
Author: Lyude Paul <firstname.lastname@example.org>
Date: Fri Nov 22 14:12:54 2019 -0500
Merge remote-tracking branch 'drm-intel/topic/core-for-CI' into drm-tip
Essential part of dmesg (full one attached with drm.debug=0x1e):
[ 77.006161] i915 0000:00:02.0: Resetting rcs0 for preemption time out
[ 77.006204] i915 0000:00:02.0: sample_multi_tr context reset due to GPU hang
[ 77.006293] [drm:__i915_request_reset [i915]] client sample_multi_tr: gained 1 ban score, now 1
i915 error state: EMPTY
Can you, please, help debug the issue from kmd stand point?
Why i915 error state is missing? is it real GPU hang or i915 bug?
Note: the above media encoder works on RCS and VCS rings, VCS tasks depend on RCS ones.
It's a forced preemption; reset from softirq context prohibits error capture and would further compromise the QoS of the more important task.
Wouldn't you mind to explain what is a forced preemption and suggest why forced preemption had happened?
Is that possible to disable it to still be able to get error state? Is that possible to force getting error state?
Forced preemption occurs when a lower priority batch does not yield the GPU to a higher preemption request [which is a denial-of-service from one and a quality-of-service guarantee for the other], over a certain timeout. See DRM_I915_PREEMPT_TIMEOUT. The plan is to have that property on the engine, but that is waiting for you to ack, so currently the override is i915.reset=2.
>> Forced preemption occurs when a lower priority batch
Thank you for clarification. However, this does not make any sense for my use case. I believe that this was the single workload running on the system. I even tried with i915.disable_display=1 - same result.
Chris, can you, please, suggest whether any issue in i915 can lead to this behavior?
(In reply to Chris Wilson from comment #3)
> Forced preemption occurs when a lower priority batch does not yield the GPU
> to a higher preemption request [which is a denial-of-service from one and a
> quality-of-service guarantee for the other], over a certain timeout.
What is higher pre-emption request in this case?
(Maybe dmesg output could identify also the higher priority process?)
(In reply to Chris Wilson from comment #1)
> It's a forced preemption; reset from softirq context prohibits error capture
> and would further compromise the QoS of the more important task.
Is there some option to enable error capture also in that case?
(Otherwise debugging those issues would become really hard...)
(In reply to Eero Tamminen from comment #5)
> (In reply to Chris Wilson from comment #3)
> > Forced preemption occurs when a lower priority batch does not yield the GPU
> > to a higher preemption request [which is a denial-of-service from one and a
> > quality-of-service guarantee for the other], over a certain timeout.
> What is higher pre-emption request in this case?
> (Maybe dmesg output could identify also the higher priority process?)
It's not the one at fault, it can be anything including the heartbeat.
> (In reply to Chris Wilson from comment #1)
> > It's a forced preemption; reset from softirq context prohibits error capture
> > and would further compromise the QoS of the more important task.
> Is there some option to enable error capture also in that case?
The option is to turn off this mode. The means to do that are likely to change when we add more knobs other than modparams.
> (Otherwise debugging those issues would become really hard...)
Is anything else needed from submission side to debug the issue on i915 side?
Issue can be reproduced both on CI_DRM_7350 and CI_DRM_7420 kernel builds.
Are media batches still non-preemptible on TGL? But do you expect a single one to run for more than 100ms?
Otherwise I think is trying to relax the timeout with https://patchwork.freedesktop.org/series/69992/ so perhaps give it a spin.
The command line I provided uses the smallest possible video. So, batches should be executed instantaneous, much less than 100ms. Actually I would like to point out 2 issues here:
1. Media fails on the key use case
2. i915 driver hides GPU hang and does not provide a way to get the error dump
The first one is critical for TGL program.
The second one is critical in a longer term - how we are supposed to debug hangs? I strongly suggest to either change/fix the behavior and allow dumps or introduce special i915 module parameter to again allow dumps. In case of a parameter I would suggest to print message into dmesg log to suggest user rerun workload with this parameter and capture dump.
Can you please try with i915.reset=2?
This occurs to be a regression in drm-tip or a change which should be reflected on user space as well. Git bisect points to the following commit. Reverting it makes the use case pass. @Mika: can you, please, comment why the change could break media usage?
Author: Mika Kuoppala <email@example.com>
Date: Tue Oct 15 18:44:49 2019 +0300
Avoid possible cs hang with semaphores by disabling
Signed-off-by: Mika Kuoppala <firstname.lastname@example.org>
Reviewed-by: Chris Wilson <email@example.com>
Signed-off-by: Chris Wilson <firstname.lastname@example.org>
I am attaching:
1. Git bisect log
2. i915 error state dumped from kernel @904ce198 (this kernel does not have preemption timeout)
Created attachment 146030 [details]
Created attachment 146031 [details]
Created attachment 146032 [details]
Setting the priority of this bug to highest considering it's a regression.
I have read the report about the VME encoding GPU you have shared. I think it is very difficult to fix the error. My system is also facing the issue. I tried many ways to solve it. https://organizetechnologies.com/ But it doesn't work.
-- GitLab Migration Automatic Message --
This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.
You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/643.