|Summary:||[TGL] media VME encoding GPU hang w/o i915 error state captured|
|Product:||DRI||Reporter:||Dmitry Rogozhkin <dmitry.v.rogozhkin>|
|Component:||DRM/Intel||Assignee:||Intel GFX Bugs mailing list <intel-gfx-bugs>|
|Status:||RESOLVED MOVED||QA Contact:||Intel GFX Bugs mailing list <intel-gfx-bugs>|
|Priority:||highest||CC:||chris, eero.t.tamminen, francesco.balestrieri, intel-gfx-bugs, james.ausmus, mika.kuoppala, tony.ye, tvrtko.ursulin|
|i915 platform:||TGL||i915 features:||GPU hang|
Description Dmitry Rogozhkin 2019-11-22 23:31:23 UTC
Created attachment 146016 [details] Full dmesg log From https://github.com/intel/media-driver/issues/773 Run: wget https://fate-suite.libav.org/h264-conformance/AUD_MW_E.264 sample_multi_transcode -i::h264 AUD_MW_E.264 -hw -async 1 -u 4 -o::h264 a.h264 Stack: * https://github.com/intel/gmmlib/commit/f78be970a6c3aef6d0347159f9c3f250421af16c * https://github.com/intel/media-driver/commit/1645d0f06597599393625168cc4445b0ae092219 * https://github.com/Intel-Media-SDK/MediaSDK/commit/2515d8fbb65979685ce086aa5c9b24786cc2cab6 + apply https://github.com/Intel-Media-SDK/MediaSDK/pull/1771 Ran with latest drm-tip kernel: commit 5bbbc0061acc528705e593d7e01c4c9c40b208db Merge: e67c139 883d955 Author: Lyude Paul <firstname.lastname@example.org> Date: Fri Nov 22 14:12:54 2019 -0500 Merge remote-tracking branch 'drm-intel/topic/core-for-CI' into drm-tip Essential part of dmesg (full one attached with drm.debug=0x1e): [ 77.006161] i915 0000:00:02.0: Resetting rcs0 for preemption time out [ 77.006204] i915 0000:00:02.0: sample_multi_tr context reset due to GPU hang [ 77.006293] [drm:__i915_request_reset [i915]] client sample_multi_tr: gained 1 ban score, now 1 i915 error state: EMPTY Can you, please, help debug the issue from kmd stand point? Why i915 error state is missing? is it real GPU hang or i915 bug? Note: the above media encoder works on RCS and VCS rings, VCS tasks depend on RCS ones.
Comment 1 Chris Wilson 2019-11-22 23:43:02 UTC
It's a forced preemption; reset from softirq context prohibits error capture and would further compromise the QoS of the more important task.
Comment 2 Dmitry Rogozhkin 2019-11-22 23:55:30 UTC
Wouldn't you mind to explain what is a forced preemption and suggest why forced preemption had happened? Is that possible to disable it to still be able to get error state? Is that possible to force getting error state?
Comment 3 Chris Wilson 2019-11-23 10:21:18 UTC
Forced preemption occurs when a lower priority batch does not yield the GPU to a higher preemption request [which is a denial-of-service from one and a quality-of-service guarantee for the other], over a certain timeout. See DRM_I915_PREEMPT_TIMEOUT. The plan is to have that property on the engine, but that is waiting for you to ack, so currently the override is i915.reset=2.
Comment 4 Dmitry Rogozhkin 2019-11-23 15:59:57 UTC
>> Forced preemption occurs when a lower priority batch Thank you for clarification. However, this does not make any sense for my use case. I believe that this was the single workload running on the system. I even tried with i915.disable_display=1 - same result. Chris, can you, please, suggest whether any issue in i915 can lead to this behavior?
Comment 5 Eero Tamminen 2019-11-25 12:02:05 UTC
(In reply to Chris Wilson from comment #3) > Forced preemption occurs when a lower priority batch does not yield the GPU > to a higher preemption request [which is a denial-of-service from one and a > quality-of-service guarantee for the other], over a certain timeout. What is higher pre-emption request in this case? (Maybe dmesg output could identify also the higher priority process?) (In reply to Chris Wilson from comment #1) > It's a forced preemption; reset from softirq context prohibits error capture > and would further compromise the QoS of the more important task. Is there some option to enable error capture also in that case? (Otherwise debugging those issues would become really hard...)
Comment 6 Chris Wilson 2019-11-25 15:33:37 UTC
(In reply to Eero Tamminen from comment #5) > (In reply to Chris Wilson from comment #3) > > Forced preemption occurs when a lower priority batch does not yield the GPU > > to a higher preemption request [which is a denial-of-service from one and a > > quality-of-service guarantee for the other], over a certain timeout. > > What is higher pre-emption request in this case? priority. > (Maybe dmesg output could identify also the higher priority process?) It's not the one at fault, it can be anything including the heartbeat. > (In reply to Chris Wilson from comment #1) > > It's a forced preemption; reset from softirq context prohibits error capture > > and would further compromise the QoS of the more important task. > > Is there some option to enable error capture also in that case? The option is to turn off this mode. The means to do that are likely to change when we add more knobs other than modparams. > (Otherwise debugging those issues would become really hard...)
Comment 7 Dmitry Rogozhkin 2019-11-25 15:42:34 UTC
Is anything else needed from submission side to debug the issue on i915 side?
Comment 8 Dmitry Rogozhkin 2019-11-26 01:53:42 UTC
Issue can be reproduced both on CI_DRM_7350 and CI_DRM_7420 kernel builds.
Comment 9 Tvrtko Ursulin 2019-11-26 09:30:26 UTC
Are media batches still non-preemptible on TGL? But do you expect a single one to run for more than 100ms? Otherwise I think is trying to relax the timeout with https://patchwork.freedesktop.org/series/69992/ so perhaps give it a spin.
Comment 10 Dmitry Rogozhkin 2019-11-26 15:48:24 UTC
The command line I provided uses the smallest possible video. So, batches should be executed instantaneous, much less than 100ms. Actually I would like to point out 2 issues here: 1. Media fails on the key use case 2. i915 driver hides GPU hang and does not provide a way to get the error dump The first one is critical for TGL program. The second one is critical in a longer term - how we are supposed to debug hangs? I strongly suggest to either change/fix the behavior and allow dumps or introduce special i915 module parameter to again allow dumps. In case of a parameter I would suggest to print message into dmesg log to suggest user rerun workload with this parameter and capture dump.
Comment 11 Tvrtko Ursulin 2019-11-26 16:06:51 UTC
Can you please try with i915.reset=2?
Comment 12 Dmitry Rogozhkin 2019-11-27 08:57:19 UTC
This occurs to be a regression in drm-tip or a change which should be reflected on user space as well. Git bisect points to the following commit. Reverting it makes the use case pass. @Mika: can you, please, comment why the change could break media usage? commit 08fff7aeddc9dd72161b4c8fc27fbab12b4b9352 Author: Mika Kuoppala <email@example.com> Date: Tue Oct 15 18:44:49 2019 +0300 drm/i915/tgl: Wa_1607138340 Avoid possible cs hang with semaphores by disabling lite restore. Signed-off-by: Mika Kuoppala <firstname.lastname@example.org> Reviewed-by: Chris Wilson <email@example.com> Signed-off-by: Chris Wilson <firstname.lastname@example.org> Link: https://email@example.com I am attaching: 1. Git bisect log 2. i915 error state dumped from kernel @904ce198 (this kernel does not have preemption timeout)
Comment 13 Dmitry Rogozhkin 2019-11-27 08:58:05 UTC
Created attachment 146030 [details] drm-tip bisect
Comment 14 Dmitry Rogozhkin 2019-11-27 08:58:33 UTC
Created attachment 146031 [details] i915_error_state_904ce198
Comment 15 Dmitry Rogozhkin 2019-11-27 08:59:22 UTC
Created attachment 146032 [details] dmesg_904ce198
Comment 16 Lakshmi 2019-11-28 09:11:19 UTC
Setting the priority of this bug to highest considering it's a regression.
Comment 17 zayiamariya 2019-11-29 11:40:26 UTC
I have read the report about the VME encoding GPU you have shared. I think it is very difficult to fix the error. My system is also facing the issue. I tried many ways to solve it. https://organizetechnologies.com/ But it doesn't work.
Comment 18 Martin Peres 2019-11-29 19:52:08 UTC
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/643.