Created attachment 130550 [details] dmesg-netconsole [ 218.412919] drm/i915: Resetting chip after gpu hang [ 218.413719] drm/i915: Resetting chip after gpu hang [ 218.414783] drm/i915: Resetting chip after gpu hang [ 218.415540] drm/i915: Resetting chip after gpu hang [ 218.416107] drm/i915: Resetting chip after gpu hang [ 218.416607] drm/i915: Resetting chip after gpu hang [ 218.417208] drm/i915: Resetting chip after gpu hang [ 218.417728] drm/i915: Resetting chip after gpu hang [ 218.418297] drm/i915: Resetting chip after gpu hang [ 218.421424] drm/i915: Resetting chip after gpu hang [ 218.422044] drm/i915: Resetting chip after gpu hang [ 218.422572] drm/i915: Resetting chip after gpu hang [ 218.423130] drm/i915: Resetting chip after gpu hang [ 218.423638] drm/i915: Resetting chip after gpu hang [ 219.429598] Failed to start request f [ 254.210145] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [migration/3:34] [ 254.210155] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [migration/1:18] [ 254.210165] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s!
That looks strongly like we took the CPU down with a GPU reset.
Might as well ask how reproducible? And system details?
Well, it was very much reproducible, but I'm no longer seeing it on the latest tip. Do you want a bisect, or can I just close?
Hmm, bisect and poke Mika -- I wonder if this relates to the system hangs he is investigating for hsw-4770r
Okay, so bisection pointed to: commit e09a3036412a959689bacf017bf2cbc226c9fea4 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Apr 11 11:13:39 2017 +0100 drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read Since the sandybridge_pcode_read() may be called from skl_pcode_request() inside an atomic context (with preempt disabled), we should avoid hitting any sleeping paths. Currently is being called with a 500ms timeout, irrespective of being inside an atomic context or not. This is reduced down to 500us to play nice with the atomic context, and that appears to be sufficient to keep BAT happy (we have a DRM_ERROR should it timeout), i.e. we do not see any 500us pcode timeouts for normal use. So leave it as a pure spin without having to introduce new code paths to separate atomic/normal contexts.
(In reply to mwa from comment #5) > Okay, so bisection pointed to: > > commit e09a3036412a959689bacf017bf2cbc226c9fea4 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Tue Apr 11 11:13:39 2017 +0100 > > drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read That's what masked/fixed the live_hangcheck failure for you?
(In reply to Chris Wilson from comment #6) > (In reply to mwa from comment #5) > > Okay, so bisection pointed to: > > > > commit e09a3036412a959689bacf017bf2cbc226c9fea4 > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > Date: Tue Apr 11 11:13:39 2017 +0100 > > > > drm/i915: Use __intel_uncore_wait_for_register_fw for sandybride_pcode_read > > That's what masked/fixed the live_hangcheck failure for you? Yes.
Hello, This seems to be resolved, is there any other relevant information on this case? Thank you.
Changing the status to CLOSED. Thank you.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.