Bug 85885 - [HSW]igt/kms_pipe_crc_basic/hang-read-crc-pipe-B causes system hang
Summary: [HSW]igt/kms_pipe_crc_basic/hang-read-crc-pipe-B causes system hang
Status: CLOSED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-11-05 02:27 UTC by lu hua
Modified: 2015-05-13 07:17 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (33.57 KB, text/plain)
2014-11-05 02:27 UTC, lu hua
no flags Details
dmesg(CONFIG_DRM_I915_FBDEV=n) (23.66 KB, text/plain)
2014-11-24 04:58 UTC, lu hua
no flags Details

Description lu hua 2014-11-05 02:27:16 UTC
Created attachment 108923 [details]
dmesg

==System Environment==
--------------------------
Regression: not sure 
Non-working platforms: HSW

==kernel==
--------------------------
drm-intel-nightly/782bafb46cc12737b16e5007583bd7b534c6202a

==Bug detailed description==
It causes system hang, It happens only one HSW machine(same as bug 85541, bug 85787).
Both -nightly and -fixes kernel have this issue.

output:
IGT-Version: 1.8-ge622850 (x86_64) (Linux: 3.18.0-rc3_drm-intel-nightly_782baf_20141104_debug+ x86_64)
hang-read-crc-pipe-B: Testing connector VGA-1 using pipe B

dmesg:
[  176.520445] Kernel panic - not syncing: Timeout synchronizing machine check over CPUs
[  177.551604] Shutting down cpus with NMI
[  177.562777] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[  177.563666] drm_kms_helper: panic occurred, switching back to text console
[  177.564624]
[  177.565511] =============================================
[  177.566389] [ INFO: possible recursive locking detected ]
[  177.567268] 3.18.0-rc3_drm-intel-nightly_782baf_20141104_debug+ #1164 Not tainted
[  177.568162] ---------------------------------------------
[  177.569045] kms_pipe_crc_ba/4247 is trying to acquire lock:
[  177.569922]  (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa002563a>] __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.570937]
[  177.570937] but task is already holding lock:
[  177.572671]  (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa002563a>] __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.573707]
[  177.573707] other info that might help us debug this:
[  177.575478]  Possible unsafe locking scenario:
[  177.575478]
[  177.577206]        CPU0
[  177.578064]        ----
[  177.578982]   lock(&dev->mode_config.mutex);
[  177.579880]   lock(&dev->mode_config.mutex);
[  177.580761]
[  177.580761]  *** DEADLOCK ***
[  177.580761]
[  177.583085]  May be due to missing lock nesting notation
[  177.583085]
[  177.584602] 5 locks held by kms_pipe_crc_ba/4247:
[  177.585370]  #0:  (&dev->mode_config.mutex){+.+.+.}, at: [<ffffffffa002563a>] __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.586305]  #1:  (crtc_ww_class_acquire){+.+.+.}, at: [<ffffffffa0025644>] __drm_modeset_lock_all+0x76/0x100 [drm]
[  177.587243]  #2:  (crtc_ww_class_mutex){+.+.+.}, at: [<ffffffffa0024f82>] drm_modeset_lock+0x5c/0xbc [drm]
[  177.588175]  #3:  (&(&dev_priv->uncore.lock)->rlock){-.-.+.}, at: [<ffffffffa00d3c16>] hsw_write32+0x90/0x124 [i915]
[  177.589124]  #4:  (panic_lock){....+.}, at: [<ffffffff8183635e>] panic+0x3d/0x1f5
[  177.590055]
[  177.590055] stack backtrace:
[  177.591583] CPU: 7 PID: 4247 Comm: kms_pipe_crc_ba Not tainted 3.18.0-rc3_drm-intel-nightly_782baf_20141104_debug+ #1164
[  177.592394] Hardware name:                  /DZ87KLT75K, BIOS KLZ8711D.86A.0336.2013.0516.1957 05/16/2013
[  177.593214]  ffffffff83eb7cd0 ffff88025fbca878 ffffffff8183ae58 0000000000000000
[  177.594136]  ffffffff83eb7cd0 ffff88025fbca948 ffffffff81074813 ffff88025fbca990
[  177.595041]  ffff88025340c000 0000000183e78bb0 ffff880200000000 28414289aca22785
[  177.595939] Call Trace:
[  177.596721]  <#MC>  [<ffffffff8183ae58>] dump_stack+0x46/0x58
[  177.597537]  [<ffffffff81074813>] __lock_acquire+0x8b2/0x1803
[  177.598326]  [<ffffffff81138dc5>] ? create_object+0x17c/0x291
[  177.599108]  [<ffffffff81075c74>] lock_acquire+0xd3/0x10d
[  177.599975]  [<ffffffffa002563a>] ? __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.600749]  [<ffffffff81071e23>] ? trace_hardirqs_off+0xd/0xf
[  177.601505]  [<ffffffff8183ec91>] mutex_lock_nested+0x4b/0x2d2
[  177.602252]  [<ffffffffa002563a>] ? __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.602998]  [<ffffffffa002560c>] ? __drm_modeset_lock_all+0x3e/0x100 [drm]
[  177.603717]  [<ffffffff818341e1>] ? kmemleak_alloc+0x25/0x41
[  177.604417]  [<ffffffff81133ed3>] ? kmem_cache_alloc_trace+0xb8/0x13c
[  177.605118]  [<ffffffffa002563a>] __drm_modeset_lock_all+0x6c/0x100 [drm]
[  177.605822]  [<ffffffffa0025724>] drm_modeset_lock_all+0x10/0x28 [drm]
[  177.606522]  [<ffffffffa0070307>] drm_fb_helper_pan_display+0x36/0xc5 [drm_kms_helper]
[  177.607219]  [<ffffffff813da179>] fb_pan_display+0xed/0x131
[  177.607904]  [<ffffffff813d52ec>] bit_update_start+0x20/0x49
[  177.608586]  [<ffffffff813d33f7>] fbcon_switch+0x452/0x469
[  177.609252]  [<ffffffff8142befe>] redraw_screen+0x112/0x1e3
[  177.609900]  [<ffffffff813d2956>] fbcon_blank+0x1e5/0x26e
[  177.610550]  [<ffffffff810902b6>] ? mod_timer+0x12a/0x184
[  177.611180]  [<ffffffff81071e23>] ? trace_hardirqs_off+0xd/0xf
[  177.611796]  [<ffffffff81841f4e>] ? _raw_spin_unlock_irqrestore+0x38/0x46
[  177.612417]  [<ffffffff810902df>] ? mod_timer+0x153/0x184
[  177.613035]  [<ffffffff8142d011>] do_unblank_screen+0xfa/0x173
[  177.613656]  [<ffffffff8142d09a>] unblank_screen+0x10/0x12
[  177.614281]  [<ffffffff8139cfdc>] bust_spinlocks+0x14/0x28
[  177.614911]  [<ffffffff81836429>] panic+0x108/0x1f5
[  177.615541]  [<ffffffff8101bac1>] mce_panic+0x159/0x18b
[  177.616177]  [<ffffffff8101bb39>] mce_timed_out+0x46/0x67
[  177.616812]  [<ffffffff8101befd>] do_machine_check+0x192/0x766
[  177.617462]  [<ffffffffa00d07e1>] ? hsw_unclaimed_reg_detect.isra.6+0x20/0x44 [i915]
[  177.618122]  [<ffffffff818441ee>] machine_check+0x1e/0x30
[  177.618785]  [<ffffffffa00d07e1>] ? hsw_unclaimed_reg_detect.isra.6+0x20/0x44 [i915]
[  177.619451]  <<EOE>>  [<ffffffffa00d3c7f>] hsw_write32+0xf9/0x124 [i915]
[  177.620172]  [<ffffffffa00fa6fc>] hsw_fdi_link_train+0xf6/0x34a [i915]
[  177.620844]  [<ffffffffa00e5f8a>] haswell_crtc_enable+0x4a4/0x8f5 [i915]
[  177.621496]  [<ffffffff810764aa>] ? trace_hardirqs_on+0xd/0xf
[  177.622142]  [<ffffffffa00e7bfa>] __intel_set_mode+0x12f4/0x1426 [i915]
[  177.622779]  [<ffffffffa00e9ffe>] intel_set_mode+0x16/0x2f [i915]
[  177.623397]  [<ffffffffa00eac81>] intel_crtc_set_config+0x77c/0xae0 [i915]
[  177.624013]  [<ffffffffa0018d34>] drm_mode_set_config_internal+0x57/0xe4 [drm]
[  177.624637]  [<ffffffffa001ca8e>] drm_mode_setcrtc+0x3ef/0x499 [drm]
[  177.625239]  [<ffffffffa0010c29>] drm_ioctl+0x2be/0x423 [drm]
[  177.625824]  [<ffffffffa001c69f>] ? drm_mode_setplane+0x1d9/0x1d9 [drm]
[  177.626392]  [<ffffffff81076441>] ? trace_hardirqs_on_caller+0x142/0x19e
[  177.626962]  [<ffffffff8114c559>] do_vfs_ioctl+0x455/0x49f
[  177.627526]  [<ffffffff810bca34>] ? __audit_syscall_entry+0xbf/0xe1
[  177.628087]  [<ffffffff8100d3b0>] ? do_audit_syscall_entry+0x63/0x65
[  177.628645]  [<ffffffff8114c5f6>] SyS_ioctl+0x53/0x81
[  177.629198]  [<ffffffff81842552>] system_call_fastpath+0x12/0x17

==Reproduce steps==
---------------------------- 
1. ./kms_pipe_crc_basic --run-subtest hang-read-crc-pipe-B
Comment 1 lu hua 2014-11-13 02:13:17 UTC
Run ./kms_flip --run-subtest flip-vs-panning also causes system hang and has same call trace.
IGT-Version: 1.8-g50d539e (x86_64) (Linux: 3.18.0-rc3_drm-intel-nightly_9a7620_20141112_debug+ x86_64)
Using monotonic timestamps
Beginning flip-vs-panning on crtc 8, connector 18
  1024x768 60 1024 1048 1184 1344 768 771 777 806 0xa 0x40 65000
Comment 2 lu hua 2014-11-14 03:18:07 UTC
Run ./kms_pipe_crc_basic --run-subtest hang-read-crc-pipe-C, system also hang, and has similar dmesg.
Comment 3 Daniel Vetter 2014-11-18 12:41:14 UTC
Oh fuck, fbdev panic handling killed the oops here.

/me cries

Please rebuild the kernel with CONFIG_DRM_I915_FBDEV=n (only for this bug here, it will kill the console), reproduce the issue and attach a new dmesg (with debugging, as usual). That way we should be able to capture the oops correctly.
Comment 4 Daniel Vetter 2014-11-18 12:42:22 UTC
Oh and since this looks super-nasty: Please try to figure out whether older stable kernels work properly or whether there's a different behaviour. I'm pretty sure that this worked once and is a regression.
Comment 5 lu hua 2014-11-24 04:58:47 UTC
Created attachment 109918 [details]
dmesg(CONFIG_DRM_I915_FBDEV=n)
Comment 6 Daniel Vetter 2014-11-24 16:06:40 UTC
[ 4483.626048] Kernel panic - not syncing: Timeout synchronizing machine check over CPUs
[ 4484.664475] Shutting down cpus with NMI

That's a fatal mce, which is either busted hw or broken bios. It's not pretty that the i915 panic handler then kills the box, but fixing that is much larger problem (and atm not at the top).

The underlying MCE issue which kills the machine otoh is plain hw issues, so closing as invalid. I guess you need to decomission/replace this machine if a bios upgrade doesn't fix this.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.