Summary: | [BAT] [HSW] drv_hangman/error-state-basic failing frequently in CI | ||
---|---|---|---|
Product: | DRI | Reporter: | Jani Saarinen <jani.saarinen> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | major | ||
Priority: | highest | CC: | dorota.czaplejewicz, intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | HSW | i915 features: |
Description
Jani Saarinen
2017-02-07 16:31:05 UTC
[ 88.317089] INFO: rcu_preempt self-detected stall on CPU [ 88.317095] 6-...: (64998 ticks this GP) idle=9fd/140000000000002/0 softirq=5389/5389 fqs=15893 [ 88.317098] (t=65000 jiffies g=2237 c=2236 q=580) [ 88.317114] Task dump for CPU 6: [ 88.317114] drv_hangman R running task 0 6133 6105 0x00000008 [ 88.317116] Call Trace: [ 88.317117] <IRQ> [ 88.317121] sched_show_task+0x15d/0x260 [ 88.317122] ? sched_show_task+0x86/0x260 [ 88.317123] dump_cpu_task+0x32/0x40 [ 88.317140] rcu_dump_cpu_stacks+0x99/0xbd [ 88.317142] rcu_check_callbacks+0x93f/0xba0 [ 88.317144] ? __this_cpu_preempt_check+0x13/0x20 [ 88.317146] ? trace_hardirqs_off_caller+0x9a/0x130 [ 88.317147] ? trace_hardirqs_off+0xd/0x10 [ 88.317149] ? tick_sched_handle.isra.7+0x30/0x30 [ 88.317150] update_process_times+0x2a/0x50 [ 88.317151] tick_sched_handle.isra.7+0x29/0x30 [ 88.317152] tick_sched_timer+0x40/0x70 [ 88.317153] __hrtimer_run_queues+0x101/0x560 [ 88.317154] hrtimer_interrupt+0xb9/0x210 [ 88.317155] local_apic_timer_interrupt+0x31/0x50 [ 88.317156] smp_apic_timer_interrupt+0x33/0x50 [ 88.317159] apic_timer_interrupt+0x90/0xa0 [ 88.317161] RIP: 0010:__do_softirq+0x98/0x4c0 [ 88.317161] RSP: 0018:ffff88041eb83f70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10 [ 88.317162] RAX: 00000000ffffffff RBX: ffff880403d62740 RCX: 0000000000000000 [ 88.317163] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: ffffffff81c73209 [ 88.317163] RBP: ffff88041eb83fc0 R08: 0000000000000000 R09: 0000000000000000 [ 88.317164] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 88.317164] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0 [ 88.317165] irq_exit+0xa9/0xc0 [ 88.317166] smp_apic_timer_interrupt+0x38/0x50 [ 88.317167] apic_timer_interrupt+0x90/0xa0 [ 88.317168] RIP: 0010:_raw_spin_unlock_irq+0x2d/0x50 [ 88.317169] RSP: 0018:ffffc90001aafbe0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10 [ 88.317170] RAX: ffff880403d62740 RBX: ffff8803ff380058 RCX: 0000000000000001 [ 88.317170] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: 0000000000000001 [ 88.317171] RBP: ffffc90001aafbe8 R08: 0000000000000000 R09: 0000000000000000 [ 88.317171] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 88.317171] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0 [ 88.317172] </IRQ> [ 88.317194] intel_engine_add_wait+0x31/0x40 [i915] [ 88.317208] i915_wait_request+0x2e6/0x800 [i915] [ 88.317209] ? wake_atomic_t_function+0x30/0x30 [ 88.317220] i915_gem_object_wait_fence+0x172/0x1a0 [i915] [ 88.317231] i915_gem_object_wait+0x29a/0x380 [i915] [ 88.317242] i915_gem_wait_ioctl+0x12c/0x290 [i915] [ 88.317253] ? i915_gem_wait_ioctl+0x22/0x290 [i915] [ 88.317255] drm_ioctl+0x200/0x450 [ 88.317266] ? i915_gem_close_object+0x120/0x120 [i915] [ 88.317267] do_vfs_ioctl+0x90/0x6e0 [ 88.317269] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 88.317270] ? __this_cpu_preempt_check+0x13/0x20 [ 88.317271] ? trace_hardirqs_on_caller+0xe7/0x200 [ 88.317272] SyS_ioctl+0x3c/0x70 [ 88.317273] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 88.317274] RIP: 0033:0x7f2be2bc8357 [ 88.317274] RSP: 002b:00007fff9c213d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 88.317275] RAX: ffffffffffffffda RBX: ffffffff81474e03 RCX: 00007f2be2bc8357 [ 88.317276] RDX: 00007fff9c213db0 RSI: 00000000c010646c RDI: 0000000000000003 [ 88.317276] RBP: ffffc90001aaff88 R08: 0000000000000008 R09: 0000000000000023 [ 88.317276] R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000000 [ 88.317277] R13: 0000000000000003 R14: 00000000c010646c R15: 0000000000000000 [ 88.317278] ? __this_cpu_preempt_check+0x13/0x20 is usually a missed irqsafe spin_lock. Bisected to "47a8e3f6aebd4d6048572b3aeaef1dd51acb0d82 drm/i915: Eliminate superfluous i915_ggtt_view_normal". This was done with 40 runs on each commit, which is not necessarily sufficient, so I will let the machine run during the week end to confirm the bisect. I also need to confirm that the bug I am seeing is the same as what is seen on CI. It's safe to say it's not 47a8e3f6aeb. (In reply to Chris Wilson from comment #3) > It's safe to say it's not 47a8e3f6aeb. OK, starting the bisect again with 200 runs between v4.9 and 4b60513d8a00fc960a2d837fcccf6a63c861d9ec(drm-tip: 2017y-02m-06d-22h-40m-55s UTC integration manifest). We will likely not get results until well into next week, but whatever! Probably: Greetings, 0day kernel testing robot got the below dmesg and the first bad commit is git://anongit.freedesktop.org/drm-intel topic/core-for-CI commit 585774774191cce113cd3ab1419499d8a7f7687e Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Wed Jan 11 17:43:02 2017 +0100 Commit: Chris Wilson <chris@chris-wilson.co.uk> CommitDate: Mon Feb 20 17:07:24 2017 +0000 locking/mutex: Clear mutex-handoff flag on interrupt On Mon, Jan 09, 2017 at 11:52:03AM +0000, Chris Wilson wrote: > If we abort the mutex_lock() due to an interrupt, or other error from s/interrupt/signal/, right? > ww_mutex, we need to relinquish the handoff flag if we applied it. > Otherwise, we may cause missed wakeups as the current owner may try to > handoff to a new thread that is not expecting the handoff and so sleep > thinking the lock is already claimed (and since the owner unlocked there > may never be a new wakeup). Isn't that the exact same scenario as Nicolai fixed here: http://lkml.kernel.org/r/1482346000-9927-3-git-send-email-nhaehnle@gmail.com Did you, like Nicolai, find this by inspection, or can you reproduce? FWIW, I have the below patch that should also solve this problem afaict. d8870ff73d mm/vmalloc: Replace opencoded 4-level page walkers 5857747741 locking/mutex: Clear mutex-handoff flag on interrupt We've replaced my suspected commit from topic/core-for-CI with the real thing. Could you leave your test running in a loop on drm-tip and see it is still reproducible? I tried this 238 times on drm-tip: 2017y-02m-27d-07h-11m-41s on HSW-4770R and no failures. Fingers crossed and hope it doesn't reoccur. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.