On HSW 4770 a lot of failures seen lately caused by IGT/drv_hangman/error-state-basic See eg. patchwork runs https://intel-gfx-ci.01.org/CI/Patchwork_3652/ https://intel-gfx-ci.01.org/CI/Patchwork_3681/ https://intel-gfx-ci.01.org/CI/Patchwork_3695/ https://intel-gfx-ci.01.org/CI/Patchwork_3696/ https://intel-gfx-ci.01.org/CI/Patchwork_3708/ https://intel-gfx-ci.01.org/CI/Patchwork_3710/ https://intel-gfx-ci.01.org/CI/Patchwork_3713/ https://intel-gfx-ci.01.org/CI/Patchwork_3721/ https://intel-gfx-ci.01.org/CI/Patchwork_3722/ See dmesg from last run: https://intel-gfx-ci.01.org/CI/Patchwork_3722/fi-hsw-4770/dmesg-during.log Do we have any idea what caused issue?
[ 88.317089] INFO: rcu_preempt self-detected stall on CPU [ 88.317095] 6-...: (64998 ticks this GP) idle=9fd/140000000000002/0 softirq=5389/5389 fqs=15893 [ 88.317098] (t=65000 jiffies g=2237 c=2236 q=580) [ 88.317114] Task dump for CPU 6: [ 88.317114] drv_hangman R running task 0 6133 6105 0x00000008 [ 88.317116] Call Trace: [ 88.317117] <IRQ> [ 88.317121] sched_show_task+0x15d/0x260 [ 88.317122] ? sched_show_task+0x86/0x260 [ 88.317123] dump_cpu_task+0x32/0x40 [ 88.317140] rcu_dump_cpu_stacks+0x99/0xbd [ 88.317142] rcu_check_callbacks+0x93f/0xba0 [ 88.317144] ? __this_cpu_preempt_check+0x13/0x20 [ 88.317146] ? trace_hardirqs_off_caller+0x9a/0x130 [ 88.317147] ? trace_hardirqs_off+0xd/0x10 [ 88.317149] ? tick_sched_handle.isra.7+0x30/0x30 [ 88.317150] update_process_times+0x2a/0x50 [ 88.317151] tick_sched_handle.isra.7+0x29/0x30 [ 88.317152] tick_sched_timer+0x40/0x70 [ 88.317153] __hrtimer_run_queues+0x101/0x560 [ 88.317154] hrtimer_interrupt+0xb9/0x210 [ 88.317155] local_apic_timer_interrupt+0x31/0x50 [ 88.317156] smp_apic_timer_interrupt+0x33/0x50 [ 88.317159] apic_timer_interrupt+0x90/0xa0 [ 88.317161] RIP: 0010:__do_softirq+0x98/0x4c0 [ 88.317161] RSP: 0018:ffff88041eb83f70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10 [ 88.317162] RAX: 00000000ffffffff RBX: ffff880403d62740 RCX: 0000000000000000 [ 88.317163] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: ffffffff81c73209 [ 88.317163] RBP: ffff88041eb83fc0 R08: 0000000000000000 R09: 0000000000000000 [ 88.317164] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 88.317164] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0 [ 88.317165] irq_exit+0xa9/0xc0 [ 88.317166] smp_apic_timer_interrupt+0x38/0x50 [ 88.317167] apic_timer_interrupt+0x90/0xa0 [ 88.317168] RIP: 0010:_raw_spin_unlock_irq+0x2d/0x50 [ 88.317169] RSP: 0018:ffffc90001aafbe0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10 [ 88.317170] RAX: ffff880403d62740 RBX: ffff8803ff380058 RCX: 0000000000000001 [ 88.317170] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: 0000000000000001 [ 88.317171] RBP: ffffc90001aafbe8 R08: 0000000000000000 R09: 0000000000000000 [ 88.317171] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 [ 88.317171] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0 [ 88.317172] </IRQ> [ 88.317194] intel_engine_add_wait+0x31/0x40 [i915] [ 88.317208] i915_wait_request+0x2e6/0x800 [i915] [ 88.317209] ? wake_atomic_t_function+0x30/0x30 [ 88.317220] i915_gem_object_wait_fence+0x172/0x1a0 [i915] [ 88.317231] i915_gem_object_wait+0x29a/0x380 [i915] [ 88.317242] i915_gem_wait_ioctl+0x12c/0x290 [i915] [ 88.317253] ? i915_gem_wait_ioctl+0x22/0x290 [i915] [ 88.317255] drm_ioctl+0x200/0x450 [ 88.317266] ? i915_gem_close_object+0x120/0x120 [i915] [ 88.317267] do_vfs_ioctl+0x90/0x6e0 [ 88.317269] ? entry_SYSCALL_64_fastpath+0x5/0xb1 [ 88.317270] ? __this_cpu_preempt_check+0x13/0x20 [ 88.317271] ? trace_hardirqs_on_caller+0xe7/0x200 [ 88.317272] SyS_ioctl+0x3c/0x70 [ 88.317273] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 88.317274] RIP: 0033:0x7f2be2bc8357 [ 88.317274] RSP: 002b:00007fff9c213d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 88.317275] RAX: ffffffffffffffda RBX: ffffffff81474e03 RCX: 00007f2be2bc8357 [ 88.317276] RDX: 00007fff9c213db0 RSI: 00000000c010646c RDI: 0000000000000003 [ 88.317276] RBP: ffffc90001aaff88 R08: 0000000000000008 R09: 0000000000000023 [ 88.317276] R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000000 [ 88.317277] R13: 0000000000000003 R14: 00000000c010646c R15: 0000000000000000 [ 88.317278] ? __this_cpu_preempt_check+0x13/0x20 is usually a missed irqsafe spin_lock.
Bisected to "47a8e3f6aebd4d6048572b3aeaef1dd51acb0d82 drm/i915: Eliminate superfluous i915_ggtt_view_normal". This was done with 40 runs on each commit, which is not necessarily sufficient, so I will let the machine run during the week end to confirm the bisect. I also need to confirm that the bug I am seeing is the same as what is seen on CI.
It's safe to say it's not 47a8e3f6aeb.
(In reply to Chris Wilson from comment #3) > It's safe to say it's not 47a8e3f6aeb. OK, starting the bisect again with 200 runs between v4.9 and 4b60513d8a00fc960a2d837fcccf6a63c861d9ec(drm-tip: 2017y-02m-06d-22h-40m-55s UTC integration manifest). We will likely not get results until well into next week, but whatever!
Probably: Greetings, 0day kernel testing robot got the below dmesg and the first bad commit is git://anongit.freedesktop.org/drm-intel topic/core-for-CI commit 585774774191cce113cd3ab1419499d8a7f7687e Author: Peter Zijlstra <peterz@infradead.org> AuthorDate: Wed Jan 11 17:43:02 2017 +0100 Commit: Chris Wilson <chris@chris-wilson.co.uk> CommitDate: Mon Feb 20 17:07:24 2017 +0000 locking/mutex: Clear mutex-handoff flag on interrupt On Mon, Jan 09, 2017 at 11:52:03AM +0000, Chris Wilson wrote: > If we abort the mutex_lock() due to an interrupt, or other error from s/interrupt/signal/, right? > ww_mutex, we need to relinquish the handoff flag if we applied it. > Otherwise, we may cause missed wakeups as the current owner may try to > handoff to a new thread that is not expecting the handoff and so sleep > thinking the lock is already claimed (and since the owner unlocked there > may never be a new wakeup). Isn't that the exact same scenario as Nicolai fixed here: http://lkml.kernel.org/r/1482346000-9927-3-git-send-email-nhaehnle@gmail.com Did you, like Nicolai, find this by inspection, or can you reproduce? FWIW, I have the below patch that should also solve this problem afaict. d8870ff73d mm/vmalloc: Replace opencoded 4-level page walkers 5857747741 locking/mutex: Clear mutex-handoff flag on interrupt
We've replaced my suspected commit from topic/core-for-CI with the real thing. Could you leave your test running in a loop on drm-tip and see it is still reproducible?
I tried this 238 times on drm-tip: 2017y-02m-27d-07h-11m-41s on HSW-4770R and no failures.
Fingers crossed and hope it doesn't reoccur.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.