Bug 99706

Summary:	[BAT] [HSW] drv_hangman/error-state-basic failing frequently in CI
Product:	DRI	Reporter:	Jani Saarinen <jani.saarinen>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	major
Priority:	highest	CC:	dorota.czaplejewicz, intel-gfx-bugs
Version:	DRI git
Hardware:	x86-64 (AMD64)
OS:	All
Whiteboard:
i915 platform:	HSW	i915 features:

Description Jani Saarinen 2017-02-07 16:31:05 UTC

On HSW 4770 a lot of failures seen lately caused by IGT/drv_hangman/error-state-basic 

See eg. patchwork runs 
https://intel-gfx-ci.01.org/CI/Patchwork_3652/
https://intel-gfx-ci.01.org/CI/Patchwork_3681/
https://intel-gfx-ci.01.org/CI/Patchwork_3695/
https://intel-gfx-ci.01.org/CI/Patchwork_3696/
https://intel-gfx-ci.01.org/CI/Patchwork_3708/
https://intel-gfx-ci.01.org/CI/Patchwork_3710/
https://intel-gfx-ci.01.org/CI/Patchwork_3713/
https://intel-gfx-ci.01.org/CI/Patchwork_3721/
https://intel-gfx-ci.01.org/CI/Patchwork_3722/

See dmesg from last run:
https://intel-gfx-ci.01.org/CI/Patchwork_3722/fi-hsw-4770/dmesg-during.log


Do we have any idea what caused issue?

Comment 1 Chris Wilson 2017-02-07 20:57:23 UTC

[   88.317089] INFO: rcu_preempt self-detected stall on CPU
[   88.317095]  6-...: (64998 ticks this GP) idle=9fd/140000000000002/0 softirq=5389/5389 fqs=15893
[   88.317098]   (t=65000 jiffies g=2237 c=2236 q=580)
[   88.317114] Task dump for CPU 6:
[   88.317114] drv_hangman     R  running task        0  6133   6105 0x00000008
[   88.317116] Call Trace:
[   88.317117]  <IRQ>
[   88.317121]  sched_show_task+0x15d/0x260
[   88.317122]  ? sched_show_task+0x86/0x260
[   88.317123]  dump_cpu_task+0x32/0x40
[   88.317140]  rcu_dump_cpu_stacks+0x99/0xbd
[   88.317142]  rcu_check_callbacks+0x93f/0xba0
[   88.317144]  ? __this_cpu_preempt_check+0x13/0x20
[   88.317146]  ? trace_hardirqs_off_caller+0x9a/0x130
[   88.317147]  ? trace_hardirqs_off+0xd/0x10
[   88.317149]  ? tick_sched_handle.isra.7+0x30/0x30
[   88.317150]  update_process_times+0x2a/0x50
[   88.317151]  tick_sched_handle.isra.7+0x29/0x30
[   88.317152]  tick_sched_timer+0x40/0x70
[   88.317153]  __hrtimer_run_queues+0x101/0x560
[   88.317154]  hrtimer_interrupt+0xb9/0x210
[   88.317155]  local_apic_timer_interrupt+0x31/0x50
[   88.317156]  smp_apic_timer_interrupt+0x33/0x50
[   88.317159]  apic_timer_interrupt+0x90/0xa0
[   88.317161] RIP: 0010:__do_softirq+0x98/0x4c0
[   88.317161] RSP: 0018:ffff88041eb83f70 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
[   88.317162] RAX: 00000000ffffffff RBX: ffff880403d62740 RCX: 0000000000000000
[   88.317163] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: ffffffff81c73209
[   88.317163] RBP: ffff88041eb83fc0 R08: 0000000000000000 R09: 0000000000000000
[   88.317164] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[   88.317164] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0
[   88.317165]  irq_exit+0xa9/0xc0
[   88.317166]  smp_apic_timer_interrupt+0x38/0x50
[   88.317167]  apic_timer_interrupt+0x90/0xa0
[   88.317168] RIP: 0010:_raw_spin_unlock_irq+0x2d/0x50
[   88.317169] RSP: 0018:ffffc90001aafbe0 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff10
[   88.317170] RAX: ffff880403d62740 RBX: ffff8803ff380058 RCX: 0000000000000001
[   88.317170] RDX: 0000000000000000 RSI: ffffffff81c97524 RDI: 0000000000000001
[   88.317171] RBP: ffffc90001aafbe8 R08: 0000000000000000 R09: 0000000000000000
[   88.317171] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
[   88.317171] R13: ffffc90001aafc70 R14: 0000000000000001 R15: ffff8804065355c0
[   88.317172]  </IRQ>
[   88.317194]  intel_engine_add_wait+0x31/0x40 [i915]
[   88.317208]  i915_wait_request+0x2e6/0x800 [i915]
[   88.317209]  ? wake_atomic_t_function+0x30/0x30
[   88.317220]  i915_gem_object_wait_fence+0x172/0x1a0 [i915]
[   88.317231]  i915_gem_object_wait+0x29a/0x380 [i915]
[   88.317242]  i915_gem_wait_ioctl+0x12c/0x290 [i915]
[   88.317253]  ? i915_gem_wait_ioctl+0x22/0x290 [i915]
[   88.317255]  drm_ioctl+0x200/0x450
[   88.317266]  ? i915_gem_close_object+0x120/0x120 [i915]
[   88.317267]  do_vfs_ioctl+0x90/0x6e0
[   88.317269]  ? entry_SYSCALL_64_fastpath+0x5/0xb1
[   88.317270]  ? __this_cpu_preempt_check+0x13/0x20
[   88.317271]  ? trace_hardirqs_on_caller+0xe7/0x200
[   88.317272]  SyS_ioctl+0x3c/0x70
[   88.317273]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[   88.317274] RIP: 0033:0x7f2be2bc8357
[   88.317274] RSP: 002b:00007fff9c213d78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   88.317275] RAX: ffffffffffffffda RBX: ffffffff81474e03 RCX: 00007f2be2bc8357
[   88.317276] RDX: 00007fff9c213db0 RSI: 00000000c010646c RDI: 0000000000000003
[   88.317276] RBP: ffffc90001aaff88 R08: 0000000000000008 R09: 0000000000000023
[   88.317276] R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000000
[   88.317277] R13: 0000000000000003 R14: 00000000c010646c R15: 0000000000000000
[   88.317278]  ? __this_cpu_preempt_check+0x13/0x20


is usually a missed irqsafe spin_lock.

Comment 2 Martin Peres 2017-02-25 00:44:44 UTC

Bisected to "47a8e3f6aebd4d6048572b3aeaef1dd51acb0d82 drm/i915: Eliminate superfluous i915_ggtt_view_normal".

This was done with 40 runs on each commit, which is not necessarily sufficient, so I will let the machine run during the week end to confirm the bisect.

I also need to confirm that the bug I am seeing is the same as what is seen on CI.

Comment 3 Chris Wilson 2017-02-25 01:33:38 UTC

It's safe to say it's not 47a8e3f6aeb.

Comment 4 Martin Peres 2017-02-25 11:25:09 UTC

(In reply to Chris Wilson from comment #3)
> It's safe to say it's not 47a8e3f6aeb.

OK, starting the bisect again with 200 runs between v4.9 and 4b60513d8a00fc960a2d837fcccf6a63c861d9ec(drm-tip: 2017y-02m-06d-22h-40m-55s UTC integration manifest).

We will likely not get results until well into next week, but whatever!

Comment 5 Chris Wilson 2017-02-27 09:50:01 UTC

Probably:

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

git://anongit.freedesktop.org/drm-intel topic/core-for-CI

commit 585774774191cce113cd3ab1419499d8a7f7687e
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed Jan 11 17:43:02 2017 +0100
Commit:     Chris Wilson <chris@chris-wilson.co.uk>
CommitDate: Mon Feb 20 17:07:24 2017 +0000

    locking/mutex: Clear mutex-handoff flag on interrupt

    On Mon, Jan 09, 2017 at 11:52:03AM +0000, Chris Wilson wrote:
    > If we abort the mutex_lock() due to an interrupt, or other error from

    s/interrupt/signal/, right?

    > ww_mutex, we need to relinquish the handoff flag if we applied it.
    > Otherwise, we may cause missed wakeups as the current owner may try to
    > handoff to a new thread that is not expecting the handoff and so sleep
    > thinking the lock is already claimed (and since the owner unlocked there
    > may never be a new wakeup).

    Isn't that the exact same scenario as Nicolai fixed here:

      http://lkml.kernel.org/r/1482346000-9927-3-git-send-email-nhaehnle@gmail.com

    Did you, like Nicolai, find this by inspection, or can you reproduce?

    FWIW, I have the below patch that should also solve this problem afaict.

d8870ff73d  mm/vmalloc: Replace opencoded 4-level page walkers
5857747741  locking/mutex: Clear mutex-handoff flag on interrupt

Comment 6 Chris Wilson 2017-02-28 18:58:21 UTC

We've replaced my suspected commit from topic/core-for-CI with the real thing. Could you leave your test running in a loop on drm-tip and see it is still reproducible?

Comment 7 Dorota Czaplejewicz 2017-03-01 12:30:01 UTC

I tried this 238 times on drm-tip: 2017y-02m-27d-07h-11m-41s on HSW-4770R and no failures.

Comment 8 Chris Wilson 2017-03-01 17:44:42 UTC

Fingers crossed and hope it doesn't reoccur.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.