Summary: | execlists causes machine lockups | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Chris Wilson <chris> | ||||
Component: | DRM/Intel | Assignee: | Rami <ramix.ben.hassine> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | critical | ||||||
Priority: | highest | CC: | intel-gfx-bugs | ||||
Version: | DRI git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
i915 platform: | BSW/CHT | i915 features: | GEM/execlists | ||||
Attachments: |
|
Description
Chris Wilson
2015-12-21 14:11:52 UTC
Still searching for the machine hang again (typical!), theory for unstable clocksource is that execlist submission is stalling interrupts for too long. Yes the execlists stall the clock update intrerrupt. Created attachment 120645 [details] [review] Move execlists to bottom-half [18317.676426] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large: [18317.676478] clocksource: 'refined-jiffies' wd_now: 1001b7d1e wd_last: 1001b7cea mask: ffffffff [18317.676492] clocksource: 'tsc' cs_now: 1aae45992d80 cs_last: 1aae0d54eae0 mask: ffffffffffffffff [18317.697834] clocksource: Switched to clocksource refined-jiffies [18485.379989] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [18485.380018] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G U 4.4.0-rc6+ #363 [18485.380037] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [18485.380052] task: ffff88027698e680 ti: ffff8802769ec000 task.ti: ffff8802769ec000 [18485.380064] RIP: 0010:[<ffffffff81477389>] [<ffffffff81477389>] cpuidle_enter_state+0x129/0x260 [18485.380086] RSP: 0018:ffff8802769efe90 EFLAGS: 00000246 [18485.380094] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000008 [18485.380102] RDX: 0000000000000000 RSI: 00000000222ebf9d RDI: 0000000000000000 [18485.380110] RBP: ffff8802769efed0 R08: 00000001001ba5e2 R09: ffff88027fd11cf0 [18485.380118] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88027fd18938 [18485.380127] R13: 0000000000000001 R14: ffffffff81a65220 R15: ffffffff81a651c0 [18485.380136] FS: 0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [18485.380148] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [18485.380156] CR2: 000000000070d000 CR3: 00000002584a0000 CR4: 00000000001006e0 [18485.380164] Stack: [18485.380171] 000010c0b3f1de91 0000000000000000 ffffffff81a65238 ffffffff81aa9278 [18485.380190] ffff88027fd18938 ffff8802769ec000 ffff8802769ec000 ffffffff81a651c0 [18485.380210] ffff8802769efee0 ffffffff814774f7 ffff8802769eff30 ffffffff810a057c [18485.380227] Call Trace: [18485.380238] [<ffffffff814774f7>] cpuidle_enter+0x17/0x20 [18485.380250] [<ffffffff810a057c>] cpu_startup_entry+0x25c/0x2b0 [18485.380261] [<ffffffff8102e87e>] start_secondary+0xfe/0x120 [18485.380270] Code: c3 0f 1f 44 00 00 31 ff e8 a5 8e c2 ff 8b 45 cc 85 c0 74 12 9c 58 f6 c4 02 0f 85 00 01 00 00 31 ff e8 1c c5 c4 ff fb 48 2b 5d c0 <48> ba cf f7 53 e3 a5 9b c4 20 48 89 d8 48 c1 fb 3f 48 f7 ea b8 [18485.380672] Kernel panic - not syncing: Hard LOCKUP [18485.380681] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G U 4.4.0-rc6+ #363 [18485.380693] Hardware name: /NUC5CPYB, BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [18485.380704] ffff88027fd05be0 ffff88027fd05b50 ffffffff8126805f ffffffff818cbb74 [18485.380722] ffff88027fd05bd0 ffffffff811071b5 0000000000000008 ffff88027fd05be0 [18485.380739] ffff88027fd05b78 0000000258473402 0000000000000046 0000000000000007 [18485.380757] Call Trace: [18485.380764] <NMI> [<ffffffff8126805f>] dump_stack+0x44/0x55 [18485.380780] [<ffffffff811071b5>] panic+0xbe/0x1df [18485.380791] [<ffffffff810d5096>] watchdog_overflow_callback+0xc6/0xd0 [18485.380802] [<ffffffff81102fac>] __perf_event_overflow+0x8c/0x1d0 [18485.380812] [<ffffffff81103914>] perf_event_overflow+0x14/0x20 [18485.380822] [<ffffffff8101ae76>] intel_pmu_handle_irq+0x1c6/0x3e0 [18485.380833] [<ffffffff810303ef>] ? native_apic_wait_icr_idle+0x1f/0x30 [18485.380843] [<ffffffff810137a8>] perf_event_nmi_handler+0x28/0x50 [18485.380854] [<ffffffff81006ad7>] nmi_handle+0x67/0x100 [18485.380863] [<ffffffff81006ee7>] do_nmi+0x117/0x340 [18485.380873] [<ffffffff81524107>] end_repeat_nmi+0x1a/0x1e [18485.380884] [<ffffffff81351ac2>] ? valleyview_pipestat_irq_handler+0x12/0x1c0 [18485.380897] [<ffffffff81351ac2>] ? valleyview_pipestat_irq_handler+0x12/0x1c0 [18485.380909] [<ffffffff81351ac2>] ? valleyview_pipestat_irq_handler+0x12/0x1c0 [18485.380919] <<EOE>> <IRQ> [<ffffffff81355ea6>] cherryview_irq_handler+0x66/0x120 [18485.380939] [<ffffffff810a8045>] handle_irq_event_percpu+0x75/0x170 [18485.380949] [<ffffffff810a816c>] handle_irq_event+0x2c/0x50 [18485.380959] [<ffffffff810ab38a>] handle_edge_irq+0x6a/0x150 [18485.381047] [<ffffffff8100597a>] handle_irq+0x1a/0x30 [18485.381057] [<ffffffff815245db>] do_IRQ+0x4b/0xd0 [18485.381067] [<ffffffff81522c7c>] common_interrupt+0x7c/0x7c [18485.381075] <EOI> [<ffffffff81477389>] ? cpuidle_enter_state+0x129/0x260 [18485.381089] [<ffffffff814774f7>] cpuidle_enter+0x17/0x20 [18485.381099] [<ffffffff810a057c>] cpu_startup_entry+0x25c/0x2b0 [18485.381108] [<ffffffff8102e87e>] start_secondary+0xfe/0x120 [18485.381123] Kernel Offset: disabled [18485.381135] Rebooting in 2 seconds.. [18487.381295] ACPI MEMORY or I/O RESET_REG. Chris, is there a way to reproduce this? I assume that the patch attached is a proposal to get this fixed? Bug scrub: Hi Rami, Could you try with the patch proposed by Chris. "./intel-gpu-tools/benchmarks/gem_latency -t -1 -p 16 -n 16" is enough to cause the machine to lockup in about 2-3 minutes. As is gem_cs_prefetch and several other existing igt. commit 3a7325e4989352c7a3150628e0f941573dffbf62 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 8 11:43:31 2016 +0000 igt/gem_exec_nop: Add a basic test Demonstrate how trivial it is to lockup Braswell, at least my N3050 nuc, by saturating the interrupt handler with a few requests. References: https://bugs.freedesktop.org/show_bug.cgi?id=93467 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> adds a basic test that reliably reproduces the issue on my Braswell. A simpler fix is diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c index 53e5104..30d8bb7 100644 --- a/drivers/gpu/drm/i915/i915_irq.c +++ b/drivers/gpu/drm/i915/i915_irq.c @@ -1829,7 +1829,7 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg) /* IRQs are synced during runtime_suspend, we don't require a wakeref */ disable_rpm_wakeref_asserts(dev_priv); - for (;;) { + do { master_ctl = I915_READ(GEN8_MASTER_IRQ) & ~GEN8_MASTER_IRQ_CONTROL; iir = I915_READ(VLV_IIR); @@ -1857,7 +1857,7 @@ static irqreturn_t cherryview_irq_handler(int irq, void *arg) I915_WRITE(GEN8_MASTER_IRQ, DE_MASTER_IRQ_CONTROL); POSTING_READ(GEN8_MASTER_IRQ); - } + } while (0); enable_rpm_wakeref_asserts(dev_priv); A contributing factor here is that ACPI on this system does not declare a HPET. Ouch. I can force enable the HPET by hardcoding its address and that is sufficient to prevent the lockups. Conversely, passing clocksource=tsc should be enough then to reproduce the NMI lockup on any system. Actually clocksource=tsc is not enough, it does require hpet=disable I can repro a similar hang on BDW i7-5xxxU: [ 181.115096] gem_exec_nop: starting subtest basic [ 224.975059] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:0] [ 224.983398] Modules linked in: asix usbnet hid_generic libphy usbhid mii coretemp lpc_ich i915 i2c_algo_bit drm_kms_helper cfbfillrect i2c_hid syscopyarea cfbimgblt hid sysfillrect sysimgblt fb_sys_fops cfbcopyarea video gpio_lynxpoint drm acpi_pad nls_iso8859_1 e1000e ahci ptp libahci pps_core [ 225.013417] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G U 4.5.0-rc5-160225+ #174 [ 225.022818] Hardware name: Intel Corporation Broadwell Client platform/WhiteTip Mountain 1, BIOS BDW-E1R1.86C.0080.R01.1406120446 06/12/2014 [ 225.037168] task: ffffffff81a0f4c0 ti: ffffffff81a00000 task.ti: ffffffff81a00000 [ 225.045690] RIP: 0010:[<ffffffff81053690>] [<ffffffff81053690>] __do_softirq+0x70/0x2a0 [ 225.054915] RSP: 0018:ffff88014f403f30 EFLAGS: 00000206 [ 225.060963] RAX: 00000000ffffffff RBX: 0000000000000000 RCX: 0000000000200000 [ 225.069097] RDX: 0000000000000072 RSI: ffff88014f40c600 RDI: 00000000740909e0 [ 225.077240] RBP: ffff88014f403f80 R08: 0000002e972ffc40 R09: 0000000000000403 [ 225.085359] R10: 0000000000000004 R11: 0000000000000005 R12: 0000000000000008 [ 225.093471] R13: 0000000000000000 R14: ffffffff81a7dbe0 R15: 0000002e5b867c93 [ 225.101587] FS: 0000000000000000(0000) GS:ffff88014f400000(0000) knlGS:0000000000000000 [ 225.110797] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 225.117336] CR2: 00007f0eb4c959d8 CR3: 0000000002a0a000 CR4: 00000000003406f0 [ 225.125469] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 225.133585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 225.141730] Stack: [ 225.144018] ffffffff81a04000 0000000a00200000 00000000fffe7655 0000002e972ffc40 [ 225.152488] ffff880100000082 0000000000000000 0000000000000008 0000000000000000 [ 225.160980] ffffffff81a7dbe0 0000002e5b867c93 ffff88014f403f90 ffffffff81053ab6 [ 225.169485] Call Trace: [ 225.172278] <IRQ> [ 225.174480] [<ffffffff81053ab6>] irq_exit+0x86/0x90 [ 225.180333] [<ffffffff8103857d>] smp_apic_timer_interrupt+0x3d/0x50 [ 225.187579] [<ffffffff8156e96f>] apic_timer_interrupt+0x7f/0x90 [ 225.194431] <EOI> [ 225.196629] [<ffffffff814a50e3>] ? cpuidle_enter_state+0xd3/0x2e0 [ 225.203875] [<ffffffff814a5312>] cpuidle_enter+0x12/0x20 [ 225.210049] [<ffffffff81084a21>] cpu_startup_entry+0x2f1/0x340 [ 225.216794] [<ffffffff815686df>] rest_init+0x7f/0x90 [ 225.222538] [<ffffffff81afb023>] start_kernel+0x470/0x47d [ 225.228765] [<ffffffff81afa9a1>] ? set_init_arg+0x55/0x55 [ 225.235042] [<ffffffff81afa5ad>] x86_64_start_reservations+0x2a/0x2c [ 225.242410] [<ffffffff81afa699>] x86_64_start_kernel+0xea/0xed [ 225.249155] Code: 89 4d b8 89 45 d0 65 48 8b 04 25 c4 23 01 00 c7 45 bc 0a 00 00 00 48 89 45 b0 65 c7 05 b6 ec fb 7e 00 00 00 00 fb b8 ff ff ff ff <49> c7 c4 c0 90 a0 81 0f bc 45 d0 83 c0 01 89 45 d4 74 7d 48 63 I went back to a nightly kernel from 25th of February just to eliminate possible recent regressions. Downside is that BDW has no potential endless loop in the irq handler so it must be something else. On my BDW I need a kernel with no debugging whatsoever to trigger this reliably. No tracing, no lockdep, even basic spinlock debugging needs to be turned off. In that setup gem_exec_nop/basic generates ~340k interrupts per second and creates numerous 10-20 second system-wide stalls. I've tried to measure the durations of various sections of the code in intel_lrc.c, and although the averages and maximums are bad (if my numbers are correct we can spend 10-25% of elapsed test time with interrupts off), I can't find anything which would block for 10-20 seconds in one go. And by inspecting the code I also can't figure out how it would happen. Also, these lockups in general seem to come and go in batches. Sometimes the system is happily chugging along with 340k irq/s, and sometimes it is stalling all the time. What makes it latch to one of these modes I have no idea. At one point I thought I see a correlation with the retire worker, but then it went away. It still feels upgrading the trylock there with a real lock improves things (more short lockups vs long ones), but I can't figure out why that would make sense. Cherryview irq loop removed: commit 579de73b048a0a4c66c25a033ac76a2836e0cf73 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Mar 14 09:01:57 2016 +0000 drm/i915: Exit cherryview_irq_handler() after one pass Now we just need a bottom-half for execlists. commit 27af5eea54d161e1636001a7f2b5efd3b7a3d1f9 Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Mon Apr 4 12:11:56 2016 +0100 drm/i915: Move execlists irq handler to a bottom half So closed. Chris, feel free to reopen if you need. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.