Bug 100457 - [HSW] live_hangcheck - Failed to start request
Summary: [HSW] live_hangcheck - Failed to start request
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-29 20:16 UTC by mwa
Modified: 2017-06-30 20:36 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features: GEM/Other


Attachments
dmesg-netconsole (160.39 KB, text/plain)
2017-03-29 20:16 UTC, mwa
no flags Details

Description mwa 2017-03-29 20:16:44 UTC
Created attachment 130550 [details]
dmesg-netconsole

[  218.412919] drm/i915: Resetting chip after gpu hang
[  218.413719] drm/i915: Resetting chip after gpu hang
[  218.414783] drm/i915: Resetting chip after gpu hang
[  218.415540] drm/i915: Resetting chip after gpu hang
[  218.416107] drm/i915: Resetting chip after gpu hang
[  218.416607] drm/i915: Resetting chip after gpu hang
[  218.417208] drm/i915: Resetting chip after gpu hang
[  218.417728] drm/i915: Resetting chip after gpu hang
[  218.418297] drm/i915: Resetting chip after gpu hang
[  218.421424] drm/i915: Resetting chip after gpu hang
[  218.422044] drm/i915: Resetting chip after gpu hang
[  218.422572] drm/i915: Resetting chip after gpu hang
[  218.423130] drm/i915: Resetting chip after gpu hang
[  218.423638] drm/i915: Resetting chip after gpu hang
[  219.429598] Failed to start request f
[  254.210145] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:34]
[  254.210155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:18]
[  254.210165] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!
Comment 1 Chris Wilson 2017-03-29 20:51:14 UTC
That looks strongly like we took the CPU down with a GPU reset.
Comment 2 Chris Wilson 2017-03-31 01:03:37 UTC
Might as well ask how reproducible? And system details?
Comment 3 mwa 2017-04-11 14:02:37 UTC
Well, it was very much reproducible, but I'm no longer seeing it on the latest tip. Do you want a bisect, or can I just close?
Comment 4 Chris Wilson 2017-04-11 14:22:17 UTC
Hmm, bisect and poke Mika -- I wonder if this relates to the system hangs he is investigating for hsw-4770r
Comment 5 mwa 2017-04-12 20:49:57 UTC
Okay, so bisection pointed to:

commit e09a3036412a959689bacf017bf2cbc226c9fea4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 11 11:13:39 2017 +0100

    drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read
    
    Since the sandybridge_pcode_read() may be called from
    skl_pcode_request() inside an atomic context (with preempt disabled), we
    should avoid hitting any sleeping paths. Currently is being called with
    a 500ms timeout, irrespective of being inside an atomic context or not.
    This is reduced down to 500us to play nice with the atomic context, and
    that appears to be sufficient to keep BAT happy (we have a DRM_ERROR
    should it timeout), i.e. we do not see any 500us pcode timeouts for
    normal use. So leave it as a pure spin without having to introduce new
    code paths to separate atomic/normal contexts.
Comment 6 Chris Wilson 2017-04-12 21:14:18 UTC
(In reply to mwa from comment #5)
> Okay, so bisection pointed to:
> 
> commit e09a3036412a959689bacf017bf2cbc226c9fea4
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Apr 11 11:13:39 2017 +0100
> 
>     drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read

That's what masked/fixed the live_hangcheck failure for you?
Comment 7 mwa 2017-04-13 10:07:59 UTC
(In reply to Chris Wilson from comment #6)
> (In reply to mwa from comment #5)
> > Okay, so bisection pointed to:
> > 
> > commit e09a3036412a959689bacf017bf2cbc226c9fea4
> > Author: Chris Wilson <chris@chris-wilson.co.uk>
> > Date:   Tue Apr 11 11:13:39 2017 +0100
> > 
> >     drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read
> 
> That's what masked/fixed the live_hangcheck failure for you?

Yes.
Comment 8 Elizabeth 2017-06-23 21:32:59 UTC
Hello,
This seems to be resolved, is there any other relevant information on this case? Thank you.
Comment 9 Elizabeth 2017-06-30 20:36:20 UTC
Changing the status to CLOSED. Thank you.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.