Summary: | [TGL] media VME encoding GPU hang w/o i915 error state captured | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Dmitry Rogozhkin <dmitry.v.rogozhkin> | ||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | major | ||||||||||||
Priority: | highest | CC: | chris, eero.t.tamminen, francesco.balestrieri, intel-gfx-bugs, james.ausmus, mika.kuoppala, tony.ye, tvrtko.ursulin | ||||||||||
Version: | XOrg git | ||||||||||||
Hardware: | Other | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | Triaged, ReadyForDev | ||||||||||||
i915 platform: | TGL | i915 features: | GPU hang | ||||||||||
Attachments: |
|
Description
Dmitry Rogozhkin
2019-11-22 23:31:23 UTC
It's a forced preemption; reset from softirq context prohibits error capture and would further compromise the QoS of the more important task. Wouldn't you mind to explain what is a forced preemption and suggest why forced preemption had happened? Is that possible to disable it to still be able to get error state? Is that possible to force getting error state? Forced preemption occurs when a lower priority batch does not yield the GPU to a higher preemption request [which is a denial-of-service from one and a quality-of-service guarantee for the other], over a certain timeout. See DRM_I915_PREEMPT_TIMEOUT. The plan is to have that property on the engine, but that is waiting for you to ack, so currently the override is i915.reset=2. >> Forced preemption occurs when a lower priority batch
Thank you for clarification. However, this does not make any sense for my use case. I believe that this was the single workload running on the system. I even tried with i915.disable_display=1 - same result.
Chris, can you, please, suggest whether any issue in i915 can lead to this behavior?
(In reply to Chris Wilson from comment #3) > Forced preemption occurs when a lower priority batch does not yield the GPU > to a higher preemption request [which is a denial-of-service from one and a > quality-of-service guarantee for the other], over a certain timeout. What is higher pre-emption request in this case? (Maybe dmesg output could identify also the higher priority process?) (In reply to Chris Wilson from comment #1) > It's a forced preemption; reset from softirq context prohibits error capture > and would further compromise the QoS of the more important task. Is there some option to enable error capture also in that case? (Otherwise debugging those issues would become really hard...) (In reply to Eero Tamminen from comment #5) > (In reply to Chris Wilson from comment #3) > > Forced preemption occurs when a lower priority batch does not yield the GPU > > to a higher preemption request [which is a denial-of-service from one and a > > quality-of-service guarantee for the other], over a certain timeout. > > What is higher pre-emption request in this case? priority. > (Maybe dmesg output could identify also the higher priority process?) It's not the one at fault, it can be anything including the heartbeat. > (In reply to Chris Wilson from comment #1) > > It's a forced preemption; reset from softirq context prohibits error capture > > and would further compromise the QoS of the more important task. > > Is there some option to enable error capture also in that case? The option is to turn off this mode. The means to do that are likely to change when we add more knobs other than modparams. > (Otherwise debugging those issues would become really hard...) Is anything else needed from submission side to debug the issue on i915 side? Issue can be reproduced both on CI_DRM_7350 and CI_DRM_7420 kernel builds. Are media batches still non-preemptible on TGL? But do you expect a single one to run for more than 100ms? Otherwise I think is trying to relax the timeout with https://patchwork.freedesktop.org/series/69992/ so perhaps give it a spin. The command line I provided uses the smallest possible video. So, batches should be executed instantaneous, much less than 100ms. Actually I would like to point out 2 issues here: 1. Media fails on the key use case 2. i915 driver hides GPU hang and does not provide a way to get the error dump The first one is critical for TGL program. The second one is critical in a longer term - how we are supposed to debug hangs? I strongly suggest to either change/fix the behavior and allow dumps or introduce special i915 module parameter to again allow dumps. In case of a parameter I would suggest to print message into dmesg log to suggest user rerun workload with this parameter and capture dump. Can you please try with i915.reset=2? This occurs to be a regression in drm-tip or a change which should be reflected on user space as well. Git bisect points to the following commit. Reverting it makes the use case pass. @Mika: can you, please, comment why the change could break media usage? commit 08fff7aeddc9dd72161b4c8fc27fbab12b4b9352 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Tue Oct 15 18:44:49 2019 +0300 drm/i915/tgl: Wa_1607138340 Avoid possible cs hang with semaphores by disabling lite restore. Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20191015154449.10338-11-mika.kuoppala@linux.intel.com I am attaching: 1. Git bisect log 2. i915 error state dumped from kernel @904ce198 (this kernel does not have preemption timeout) Created attachment 146030 [details]
drm-tip bisect
Created attachment 146031 [details]
i915_error_state_904ce198
Created attachment 146032 [details]
dmesg_904ce198
Setting the priority of this bug to highest considering it's a regression. I have read the report about the VME encoding GPU you have shared. I think it is very difficult to fix the error. My system is also facing the issue. I tried many ways to solve it. https://organizetechnologies.com/ But it doesn't work. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/643. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.