Bug 105600 - [CI] igt@* - incomplete - Softdog on gvtdvm machines, pi_lock lockdep in boot.log
Summary: [CI] igt@* - incomplete - Softdog on gvtdvm machines, pi_lock lockdep in boot...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: XiongZhang
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 105599 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-03-19 10:03 UTC by Marta Löfstedt
Modified: 2019-05-31 10:27 UTC (History)
3 users (show)

See Also:
i915 platform: BDW, SKL
i915 features: display/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marta Löfstedt 2018-03-19 10:03:39 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_2/fi-skl-gvtdvm/igt@kms_frontbuffer_tracking@fbcdrrs-1p-primscrn-cur-indfb-draw-mmap-cpu.html

running: igt/kms_frontbuffer_tracking/fbcdrrs-1p-primscrn-cur-indfb-draw-mmap-cpu

[00/97]  |                                                                       
owatch: TIMEOUT!
owatch: timeout for /dev/watchdog0 set to 10 (requested 10)
FATAL: command execution failed
java.io.EOFException

This is all in dmesg:
<5>[   26.332741] owatch: Using watchdog device /dev/watchdog0
<5>[   26.336059] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   26.344558] owatch: timeout for /dev/watchdog0 set to 410 (requested 410)
<6>[   31.746834] perf: interrupt took too long (4097 > 4026), lowering kernel.perf_event_max_sample_rate to 48000
<7>[   38.036000] [IGT] kms_frontbuffer_tracking: executing
<6>[   41.810611] perf: interrupt took too long (5141 > 5121), lowering kernel.perf_event_max_sample_rate to 38000
<6>[   57.951869] perf: interrupt took too long (6455 > 6426), lowering kernel.perf_event_max_sample_rate to 30000
<6>[   84.774600] perf: interrupt took too long (8074 > 8068), lowering kernel.perf_event_max_sample_rate to 24000
<6>[  128.839162] perf: interrupt took too long (10096 > 10092), lowering kernel.perf_event_max_sample_rate to 19000
Comment 1 Marta Löfstedt 2018-03-19 10:04:47 UTC
This is the second Softdog incomplete on the first test on fi-skl-gvtdvm, see bug 105599
Comment 2 Marta Löfstedt 2018-03-20 06:26:41 UTC
Here is another frontbuffer_tracking Softdog that is also on the first test in the shard:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_4/fi-skl-gvtdvm/igt@kms_frontbuffer_tracking@psrdrrs-1p-primscrn-cur-indfb-move.html

running: igt/kms_frontbuffer_tracking/psrdrrs-1p-primscrn-cur-indfb-move

[00/97]  |                                                              
owatch: TIMEOUT!

This is all in dmesg:
<5>[   22.259150] owatch: Using watchdog device /dev/watchdog0
<5>[   22.260727] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   22.262484] owatch: timeout for /dev/watchdog0 set to 410 (requested 410)
<6>[   22.274177] perf: interrupt took too long (2567 > 2500), lowering kernel.perf_event_max_sample_rate to 77000
<6>[   27.457786] perf: interrupt took too long (3292 > 3208), lowering kernel.perf_event_max_sample_rate to 60000
<7>[   28.407794] [IGT] kms_frontbuffer_tracking: executing
<6>[  195.917343] perf: interrupt took too long (4658 > 4115), lowering kernel.perf_event_max_sample_rate to 42000
Comment 3 Marta Löfstedt 2018-03-20 06:27:57 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_4/fi-skl-gvtdvm/igt@kms_frontbuffer_tracking@psr-2p-scndscrn-pri-indfb-draw-mmap-wc.html

running: igt/kms_frontbuffer_tracking/psr-2p-scndscrn-pri-indfb-draw-mmap-wc

[00/97]  |                                                                  
owatch: TIMEOUT!
owatch: timeout for /dev/watchdog0 set to 10 (requested 10)

dmesg:
<5>[   23.179763] owatch: Using watchdog device /dev/watchdog0
<5>[   23.181583] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   23.183544] owatch: timeout for /dev/watchdog0 set to 410 (requested 410)
<6>[   23.635390] perf: interrupt took too long (2526 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
<6>[   28.698246] perf: interrupt took too long (3259 > 3157), lowering kernel.perf_event_max_sample_rate to 61000
<7>[   29.104037] [IGT] kms_frontbuffer_tracking: executing
Comment 4 Marta Löfstedt 2018-03-20 06:29:28 UTC
*** Bug 105599 has been marked as a duplicate of this bug. ***
Comment 5 Marta Löfstedt 2018-03-20 06:31:28 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_3/fi-skl-gvtdvm/igt@kms_frontbuffer_tracking@drrs-2p-scndscrn-indfb-pgflip-blt.html

running: igt/kms_frontbuffer_tracking/drrs-2p-scndscrn-indfb-pgflip-blt

[00/97]  |                                                             
owatch: TIMEOUT!
owatch: timeout for /dev/watchdog0 set to 10 (requested 10)

<5>[   23.878119] owatch: Using watchdog device /dev/watchdog0
<5>[   23.879865] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   23.881902] owatch: timeout for /dev/watchdog0 set to 410 (requested 410)
<6>[   25.013459] perf: interrupt took too long (2544 > 2500), lowering kernel.perf_event_max_sample_rate to 78000
<7>[   29.755628] [IGT] kms_frontbuffer_tracking: executing
<6>[   37.658039] perf: interrupt took too long (3295 > 3180), lowering kernel.perf_event_max_sample_rate to 60000
Comment 6 Marta Löfstedt 2018-03-20 06:49:38 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_4/fi-bdw-gvtdvm/igt@kms_properties@plane-properties-atomic.html

run.log:
running: igt/kms_properties/plane-properties-atomic

[00/97]  |                                         
owatch: TIMEOUT!
owatch: timeout for /dev/watchdog0 set to 10 (requested 10)

Full dmesg:
<5>[   24.071216] owatch: Using watchdog device /dev/watchdog0
<5>[   24.073518] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   24.076762] owatch: timeout for /dev/watchdog0 set to 410 (requested 410)
<6>[   25.127388] perf: interrupt took too long (3211 > 3197), lowering kernel.perf_event_max_sample_rate to 62000
<6>[   30.692676] perf: interrupt took too long (4124 > 4013), lowering kernel.perf_event_max_sample_rate to 48000
<7>[   31.118853] [IGT] kms_properties: executing

boot log:

<4>[   10.874799] ------------[ cut here ]------------
<3>[   10.875470] ODEBUG: free active (active state 0) object type: hrtimer hint: hrtimer_wakeup+0x0/0x20
<4>[   10.876334] WARNING: CPU: 0 PID: 196 at lib/debugobjects.c:291 debug_print_object+0x67/0x80
<4>[   10.877186] Modules linked in: i915(+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel prime_numbers e1000(+) i2c_piix4
<4>[   10.878107] CPU: 0 PID: 196 Comm: systemd-udevd Not tainted 4.16.0-rc5-g639c78d2805b-drmtip_4+ #1
<4>[   10.878977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[   10.879886] RIP: 0010:debug_print_object+0x67/0x80
<4>[   10.880345] RSP: 0018:ffffa5c0c027fcd8 EFLAGS: 00010092
<4>[   10.880895] RAX: 0000000000000057 RBX: ffffa24c27a8dbf0 RCX: 0000000000000002
<4>[   10.881399] RDX: 0000000080000002 RSI: ffffffffa8076546 RDI: 00000000ffffffff
<4>[   10.881948] RBP: ffffffffa8249cc0 R08: 0000000000000000 R09: 0000000000000000
<4>[   10.882452] R10: ffffa5c0c027fc70 R11: ffffffffa70efb8b R12: ffffffffa80747d4
<4>[   10.883022] R13: ffffffffa97b9560 R14: 0000000000000001 R15: ffffffffa97b9568
<4>[   10.883528] FS:  00007f5f6abb8480(0000) GS:ffffa24c3fc00000(0000) knlGS:0000000000000000
<4>[   10.884392] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   10.884912] CR2: 00005572006db178 CR3: 00000000780c0006 CR4: 00000000003606f0
<4>[   10.885417] Call Trace:
<4>[   10.885895]  debug_check_no_obj_freed+0x1ef/0x230
<4>[   10.886364]  __vunmap+0x5b/0x110
<4>[   10.886797]  load_module+0x2347/0x2dc0
<4>[   10.887261]  ? show_coresize+0x20/0x20
<4>[   10.887701]  ? vfs_read+0x126/0x150
<4>[   10.888179]  ? SyS_finit_module+0xa5/0xe0
<4>[   10.888631]  SyS_finit_module+0xa5/0xe0
<4>[   10.889117]  do_syscall_64+0x65/0x1a0
<4>[   10.889557]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
<4>[   10.890065] RIP: 0033:0x7f5f6a4bc839
<4>[   10.890501] RSP: 002b:00007fff477482b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4>[   10.891346] RAX: ffffffffffffffda RBX: 00005640e00d8730 RCX: 00007f5f6a4bc839
<4>[   10.891851] RDX: 0000000000000000 RSI: 00007f5f6a19b0e5 RDI: 000000000000000f
<4>[   10.892402] RBP: 00007f5f6a19b0e5 R08: 0000000000000000 R09: 00007fff477483d0
<4>[   10.892907] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000000000
<4>[   10.893452] R13: 00005640e00cdd30 R14: 0000000000020000 R15: 00005640e00d8730
<4>[   10.893959] Code: c1 8b 43 10 83 c2 01 8b 4b 14 89 15 2c b5 28 02 4c 8b 45 00 4c 89 e6 48 c7 c7 30 83 09 a8 48 8b 14 c5 e0 3b e4 a7 e8 09 f3 c1 ff <0f> 0b 5b 83 05 2b 48 ee 00 01 5d 41 5c c3 83 05 20 48 ee 00 01 
<4>[   10.895420] WARNING: CPU: 0 PID: 196 at lib/debugobjects.c:291 debug_print_object+0x67/0x80
<4>[   10.896303] ---[ end trace 77cae12efc453097 ]---

<4>[   10.896773] ======================================================
<4>[   10.896773] WARNING: possible circular locking dependency detected
<4>[   10.896773] 4.16.0-rc5-g639c78d2805b-drmtip_4+ #1 Not tainted
<4>[   10.896774] ------------------------------------------------------
<4>[   10.896774] systemd-udevd/196 is trying to acquire lock:
<4>[   10.896774]  ((console_sem).lock){..-.}, at: [<00000000673d59fd>] down_trylock+0xa/0x30

<4>[   10.896776] but task is already holding lock:
<4>[   10.896776]  (&obj_hash[i].lock){-.-.}, at: [<000000002f519ca4>] debug_check_no_obj_freed+0xa4/0x230

<4>[   10.896777] which lock already depends on the new lock.


<4>[   10.896778] the existing dependency chain (in reverse order) is:

<4>[   10.896779] -> #3 (&obj_hash[i].lock){-.-.}:
<4>[   10.896780]        __debug_object_init+0x5f/0x430
<4>[   10.896780]        hrtimer_init+0x1b/0x150
<4>[   10.896780]        init_dl_task_timer+0x17/0x30
<4>[   10.896781]        __sched_fork.isra.16+0xa7/0x100
<4>[   10.896781]        init_idle+0x53/0x270
<4>[   10.896781]        sched_init+0x42f/0x4a5
<4>[   10.896782]        start_kernel+0x20e/0x3f0
<4>[   10.896782]        secondary_startup_64+0xa5/0xb0

<4>[   10.896782] -> #2 (&rq->lock){-.-.}:
<4>[   10.896783]        task_fork_fair+0x36/0x120
<4>[   10.896784]        sched_fork+0x12f/0x2a0
<4>[   10.896784]        copy_process.part.7+0x5e4/0x1d90
<4>[   10.896784]        _do_fork+0xc0/0x6b0
<4>[   10.896785]        kernel_thread+0x20/0x30
<4>[   10.896785]        rest_init+0x1d/0x220
<4>[   10.896785]        start_kernel+0x3e8/0x3f0
<4>[   10.896786]        secondary_startup_64+0xa5/0xb0

<4>[   10.896786] -> #1 (&p->pi_lock){-.-.}:
<4>[   10.896787]        try_to_wake_up+0x27/0x610
<4>[   10.896787]        up+0x3b/0x50
<4>[   10.896788]        __up_console_sem+0x2e/0x50
<4>[   10.896788]        console_unlock+0x31f/0x650
<4>[   10.896788]        con_write+0x43/0x60
<4>[   10.896789]        n_tty_write+0x1a7/0x450
<4>[   10.896789]        tty_write+0x1b6/0x2e0
<4>[   10.896789]        do_iter_write+0x140/0x180
<4>[   10.896790]        vfs_writev+0x84/0xf0
<4>[   10.896790]        do_writev+0x4c/0xd0
<4>[   10.896790]        do_syscall_64+0x65/0x1a0
<4>[   10.896790]        entry_SYSCALL_64_after_hwframe+0x42/0xb7

<4>[   10.896791] -> #0 ((console_sem).lock){..-.}:
<4>[   10.896792]        _raw_spin_lock_irqsave+0x33/0x50
<4>[   10.896792]        down_trylock+0xa/0x30
<4>[   10.896793]        __down_trylock_console_sem+0x20/0x80
<4>[   10.896793]        console_trylock+0xe/0x60
<4>[   10.896793]        vprintk_emit+0x24f/0x4f0
<4>[   10.896793]        printk+0x3e/0x46
<4>[   10.896794]        __warn_printk+0x37/0x70
<4>[   10.896794]        debug_print_object+0x67/0x80
<4>[   10.896794]        debug_check_no_obj_freed+0x1ef/0x230
<4>[   10.896795]        __vunmap+0x5b/0x110
<4>[   10.896795]        load_module+0x2347/0x2dc0
<4>[   10.896795]        SyS_finit_module+0xa5/0xe0
<4>[   10.896795]        do_syscall_64+0x65/0x1a0
<4>[   10.896796]        entry_SYSCALL_64_after_hwframe+0x42/0xb7

<4>[   10.896796] other info that might help us debug this:

<4>[   10.896797] Chain exists of:
<4>[   10.896797]   (console_sem).lock --> &rq->lock --> &obj_hash[i].lock

<4>[   10.896799]  Possible unsafe locking scenario:

<4>[   10.896799]        CPU0                    CPU1
<4>[   10.896800]        ----                    ----
<4>[   10.896800]   lock(&obj_hash[i].lock);
<4>[   10.896801]                                lock(&rq->lock);
<4>[   10.896801]                                lock(&obj_hash[i].lock);
<4>[   10.896802]   lock((console_sem).lock);

<4>[   10.896803]  *** DEADLOCK ***

<4>[   10.896803] 1 lock held by systemd-udevd/196:
<4>[   10.896804]  #0:  (&obj_hash[i].lock){-.-.}, at: [<000000002f519ca4>] debug_check_no_obj_freed+0xa4/0x230

<4>[   10.896805] stack backtrace:
<4>[   10.896805] CPU: 0 PID: 196 Comm: systemd-udevd Not tainted 4.16.0-rc5-g639c78d2805b-drmtip_4+ #1
<4>[   10.896806] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[   10.896806] Call Trace:
<4>[   10.896807]  dump_stack+0x5f/0x86
<4>[   10.896807]  print_circular_bug.isra.18+0x1d0/0x2c0
<4>[   10.896807]  __lock_acquire+0x14ae/0x1b60
<4>[   10.896807]  ? lock_acquire+0xaf/0x200
<4>[   10.896808]  lock_acquire+0xaf/0x200
<4>[   10.896808]  ? down_trylock+0xa/0x30
<4>[   10.896808]  ? vprintk_emit+0x24f/0x4f0
<4>[   10.896809]  _raw_spin_lock_irqsave+0x33/0x50
<4>[   10.896809]  ? down_trylock+0xa/0x30
<4>[   10.896809]  down_trylock+0xa/0x30
<4>[   10.896809]  __down_trylock_console_sem+0x20/0x80
<4>[   10.896810]  console_trylock+0xe/0x60
<4>[   10.896810]  vprintk_emit+0x24f/0x4f0
<4>[   10.896810]  printk+0x3e/0x46
<4>[   10.896811]  ? clock_was_set_work+0x20/0x20
<4>[   10.896811]  __warn_printk+0x37/0x70
<4>[   10.896811]  ? debug_check_no_obj_freed+0xa4/0x230
<4>[   10.896811]  ? clock_was_set_work+0x20/0x20
<4>[   10.896812]  debug_print_object+0x67/0x80
<4>[   10.896812]  debug_check_no_obj_freed+0x1ef/0x230
<4>[   10.896812]  __vunmap+0x5b/0x110
<4>[   10.896813]  load_module+0x2347/0x2dc0
<4>[   10.896813]  ? show_coresize+0x20/0x20
<4>[   10.896813]  ? vfs_read+0x126/0x150
<4>[   10.896813]  ? SyS_finit_module+0xa5/0xe0
<4>[   10.896814]  SyS_finit_module+0xa5/0xe0
<4>[   10.896814]  do_syscall_64+0x65/0x1a0
<4>[   10.896814]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
<4>[   10.896815] RIP: 0033:0x7f5f6a4bc839
<4>[   10.896815] RSP: 002b:00007fff477482b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4>[   10.896816] RAX: ffffffffffffffda RBX: 00005640e00d8730 RCX: 00007f5f6a4bc839
<4>[   10.896816] RDX: 0000000000000000 RSI: 00007f5f6a19b0e5 RDI: 000000000000000f
<4>[   10.896816] RBP: 00007f5f6a19b0e5 R08: 0000000000000000 R09: 00007fff477483d0
<4>[   10.896817] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000000000
<4>[   10.896817] R13: 00005640e00cdd30 R14: 0000000000020000 R15: 00005640e00d8730
<1>[   12.520047] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
<1>[   12.520325] IP: hrtimer_active+0x4/0x40
<6>[   12.520485] PGD 0 P4D 0 
<4>[   12.520613] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[   12.520772] Modules linked in: i915(+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel prime_numbers e1000(+) i2c_piix4
<4>[   12.521105] CPU: 0 PID: 196 Comm: systemd-udevd Tainted: G        W        4.16.0-rc5-g639c78d2805b-drmtip_4+ #1
<4>[   12.521391] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[   12.521699] RIP: 0010:hrtimer_active+0x4/0x40
<4>[   12.521860] RSP: 0018:ffffa5c0c027fca8 EFLAGS: 00010282
<4>[   12.522038] RAX: 0000000000000000 RBX: ffffa5c0c0343a18 RCX: 0000000000000001
<4>[   12.522276] RDX: 0000000000000003 RSI: 0000000000000003 RDI: ffffa5c0c0343a18
<4>[   12.522490] RBP: 0000000000000000 R08: ffffa5c0c0343a18 R09: 0000000000000000
<4>[   12.522705] R10: 0000000000000000 R11: ffffffffa70efb8b R12: ffffa5c0c548f000
<4>[   12.522920] R13: ffffffffa97b9560 R14: ffffffffa8249cc0 R15: ffffffffa97b9568
<4>[   12.523166] FS:  00007f5f6abb8480(0000) GS:ffffa24c3fc00000(0000) knlGS:0000000000000000
<4>[   12.523412] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   12.523601] CR2: 0000000000000010 CR3: 00000000780c0005 CR4: 00000000003606f0
<4>[   12.524112] Call Trace:
<4>[   12.524563]  hrtimer_try_to_cancel+0x14/0x1d0
<4>[   12.525016]  hrtimer_fixup_free+0x15/0x40
<4>[   12.525505]  debug_object_fixup+0xf/0x30
<4>[   12.525948]  debug_check_no_obj_freed+0x221/0x230
<4>[   12.526451]  __vunmap+0x5b/0x110
<4>[   12.526893]  load_module+0x2347/0x2dc0
<4>[   12.527399]  ? show_coresize+0x20/0x20
<4>[   12.527844]  ? vfs_read+0x126/0x150
<4>[   12.528285]  ? SyS_finit_module+0xa5/0xe0
<4>[   12.528760]  SyS_finit_module+0xa5/0xe0
<4>[   12.529204]  do_syscall_64+0x65/0x1a0
<4>[   12.529685]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
<4>[   12.530151] RIP: 0033:0x7f5f6a4bc839
<4>[   12.530628] RSP: 002b:00007fff477482b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4>[   12.531433] RAX: ffffffffffffffda RBX: 00005640e00d8730 RCX: 00007f5f6a4bc839
<4>[   12.531976] RDX: 0000000000000000 RSI: 00007f5f6a19b0e5 RDI: 000000000000000f
<4>[   12.532487] RBP: 00007f5f6a19b0e5 R08: 0000000000000000 R09: 00007fff477483d0
<4>[   12.533031] R10: 000000000000000f R11: 0000000000000246 R12: 0000000000000000
<4>[   12.533550] R13: 00005640e00cdd30 R14: 0000000000020000 R15: 00005640e00d8730
<4>[   12.534085] Code: ff ff ff 48 8b 57 70 eb bf 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 48 c7 47 28 10 b5 10 a7 48 89 77 40 c3 0f 1f 00 48 8b 47 30 <8b> 50 10 f6 c2 01 75 1d 80 7f 38 00 75 23 48 3b 78 38 74 1d 39 
<1>[   12.535484] RIP: hrtimer_active+0x4/0x40 RSP: ffffa5c0c027fca8
<4>[   12.536008] CR2: 0000000000000010
<4>[   12.536447] ---[ end trace 77cae12efc453098 ]---
<3>[   12.536945] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34
<3>[   12.537821] in_atomic(): 0, irqs_disabled(): 1, pid: 196, name: systemd-udevd
<4>[   12.538327] INFO: lockdep is turned off.
<4>[   12.538771] irq event stamp: 365170
<4>[   12.539247] hardirqs last  enabled at (365169): [<0000000025e2fc5e>] _raw_spin_unlock_irqrestore+0x4c/0x60
<4>[   12.540135] hardirqs last disabled at (365170): [<0000000037f8c343>] _raw_spin_lock_irqsave+0xd/0x50
<4>[   12.541021] softirqs last  enabled at (360940): [<0000000020eb4dde>] __do_softirq+0x3a1/0x4aa
<4>[   12.541847] softirqs last disabled at (360933): [<000000004c971658>] irq_exit+0xa4/0xb0
<4>[   12.542683] CPU: 0 PID: 196 Comm: systemd-udevd Tainted: G      D W        4.16.0-rc5-g639c78d2805b-drmtip_4+ #1
<4>[   12.543580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[   12.544511] Call Trace:
<4>[   12.544928]  dump_stack+0x5f/0x86
<4>[   12.545384]  ___might_sleep+0x1d9/0x240
<4>[   12.545826]  exit_signals+0x1b/0x2a0
<4>[   12.546304]  do_exit+0x93/0xcb0
<4>[   12.546734]  rewind_stack_do_exit+0x17/0x20
<6>[   13.042212] PCI Interrupt Link [LNKC] enabled at IRQ 10
Comment 7 Daniel Vetter 2018-03-20 07:31:13 UTC
I created a small debug patch which hopefully reproduces the lockdep splat every time:

https://patchwork.freedesktop.org/patch/211427/

I have no idea how to fix the lockdep splat, too deep in core code. Also no idea whether the lockdep splat indicates a real issue that might hang the test or not.
Comment 9 Marta Löfstedt 2018-03-20 10:37:46 UTC
(In reply to Daniel Vetter from comment #7)
> I created a small debug patch which hopefully reproduces the lockdep splat
> every time:
> 
> https://patchwork.freedesktop.org/patch/211427/
> 
> I have no idea how to fix the lockdep splat, too deep in core code. Also no
> idea whether the lockdep splat indicates a real issue that might hang the
> test or not.

Daniel, I only put the test with the lockdep that incompletes on this bug.
Comment 10 Marta Löfstedt 2018-03-21 09:35:34 UTC
First test lockdep in boot.log:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_5/fi-skl-gvtdvm/igt@perf_pmu@init-sema-vcs1.html
Comment 14 Daniel Vetter 2018-03-26 19:56:01 UTC
test patch in topic/core-for-CI, hopefully sheds some light on where this stuff is from:

commit 0162204f7aa1db746262eb2b68d294c25e5be36c (HEAD -> topic/core-for-CI, drm-intel/topic/core-for-CI)
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Mar 20 17:02:58 2018 +0100

    RFC: debugobjects: capture stack traces at _init() time
Comment 15 Marta Löfstedt 2018-04-03 07:10:30 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_8/fi-skl-gvtdvm/igt@gem_exec_schedule@reorder-wide-render.html

boot log has the lockdep:
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_8/fi-skl-gvtdvm/boot3.log

owatch on first test:
running: igt/gem_exec_schedule/reorder-wide-render

[00/99]  |                                        
owatch: TIMEOUT!
Comment 16 Marta Löfstedt 2018-04-03 07:18:10 UTC
It also happened here but links are dead on public servers I pulled data from internal to check
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_8/fi-skl-gvtdvm/igt@kms_vblank@pipe-d-query-forked.html
Comment 17 Daniel Vetter 2018-04-03 07:43:33 UTC
New hit, now with the ODEBUG backtraced patch:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_8/fi-skl-gvtdvm/igt@gem_exec_schedule@reorder-wide-render.html

Relevant bits of dmesg (without the lockdep splat, which is secondary):

<4>[   10.850889] WARNING: CPU: 0 PID: 194 at lib/debugobjects.c:301 debug_print_object+0xac/0xe0
<4>[   10.851698] ---[ end trace 229b22432d4ca3d7 ]---
<3>[   10.852134] ODEBUG: ODEBUG: debug object originally initialized at:
<4>[   10.852640]    futex_wait+0x69/0x220
<4>[   10.853059]    do_futex+0x482/0xb00
<4>[   10.853475]    SyS_futex+0x118/0x150
<4>[   10.853935]    do_syscall_64+0x6b/0x1b0
<4>[   10.854358]    entry_SYSCALL_64_after_hwframe+0x42/0xb7


Blowing up at

<4>[   10.854893]  debug_print_object+0xac/0xe0
<4>[   10.854893]  ? __schedule+0x2a6/0xbb0
<4>[   10.854894]  ? lock_acquire+0xa6/0x210
<4>[   10.854894]  ? debug_check_no_obj_freed+0xac/0x220
<4>[   10.854894]  debug_check_no_obj_freed+0x1e6/0x220
<4>[   10.854895]  __vunmap+0xab/0x100
<4>[   10.854895]  load_module+0x2388/0x2ac0
<4>[   10.854895]  ? SyS_finit_module+0xb8/0xd0
<4>[   10.854896]  SyS_finit_module+0xb8/0xd0
<4>[   10.854896]  do_syscall_64+0x6b/0x1b0
<4>[   10.854896]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

Followed shortly by an Oops:

<1>[   13.299232] BUG: unable to handle kernel NULL pointer dereference at 0000000000000006
<1>[   13.299238] IP: do_raw_spin_lock+0x1/0xb0
<6>[   13.299239] PGD 0 P4D 0 
<4>[   13.299241] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[   13.299243] Modules linked in: i915(+) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel e1000(+) i2c_piix4 prime_numbers
<4>[   13.299251] CPU: 0 PID: 194 Comm: systemd-udevd Tainted: G        W        4.16.0-rc7-g0539b52e05cd-drmtip_8+ #1
<4>[   13.299264] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[   13.299266] RIP: 0010:do_raw_spin_lock+0x1/0xb0
<4>[   13.299267] RSP: 0018:ffffbbf980257c58 EFLAGS: 00010086
<4>[   13.299268] RAX: ffffa3833ac6d040 RBX: 0000000000000002 RCX: 0000000000000000
<4>[   13.299269] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
<4>[   13.299270] RBP: 0000000000000287 R08: 0000000000000001 R09: ffffffff9010c09f
<4>[   13.299271] R10: ffffbbf980257c00 R11: 0000000000000000 R12: ffffbbf980257ca8
<4>[   13.299271] R13: 0000000000000003 R14: ffffffff9273da68 R15: ffffffff91249cc0
<4>[   13.299275] FS:  00007fbecbb66480(0000) GS:ffffa3833fc00000(0000) knlGS:0000000000000000
<4>[   13.299276] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   13.299276] CR2: 0000000000000006 CR3: 0000000077bc8002 CR4: 00000000003606f0
<4>[   13.299277] Call Trace:
<4>[   13.299281]  _raw_spin_lock_irqsave+0x3b/0x50
<4>[   13.299283]  ? lock_hrtimer_base.isra.2+0x1f/0x40
<4>[   13.299285]  lock_hrtimer_base.isra.2+0x1f/0x40
<4>[   13.299286]  hrtimer_try_to_cancel+0x36/0x1d0
<4>[   13.299288]  hrtimer_fixup_free+0x15/0x40
<4>[   13.299290]  debug_object_fixup+0x12/0x30
<4>[   13.299292]  debug_check_no_obj_freed+0x212/0x220
<4>[   13.299295]  __vunmap+0xab/0x100
<4>[   13.299298]  load_module+0x2388/0x2ac0
<4>[   13.299303]  ? SyS_finit_module+0xb8/0xd0
<4>[   13.299305]  SyS_finit_module+0xb8/0xd0
<4>[   13.299309]  do_syscall_64+0x6b/0x1b0
<4>[   13.299312]  entry_SYSCALL_64_after_hwframe+0x42/0xb7

I hope I pasted the right bits togethere, the ODEBUG, OOps and lockdep splat are all a bit interleaved.
Comment 19 XiongZhang 2018-04-09 08:08:00 UTC
I couldn't reproduce this issue on my SKL machine which doesn't have eDP screen.
I compiled kernel with:
CONFIG_CC_STACKPROTECTOR_AUTO=y
CONFIG_DEBUG_STACK_USAGE=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_SCHED_STACK_END_CHECK=y

Please paste your qemu command which is used to boot a vm ?

From ci's dmesg, two graphic cards exist after vm boot up, one is emulated qxl or circus card, the other is IGD card, and the emulated qxl or circus card is primary graphic card. So the igt case may run against qxl or circus card, not IGD card.

You could use the following qemu command which make sure vm has IGD card only.
qemu-system-x86_64 -enable-kvm -smp 2 -m 6G -cpu host -nographic -vga none -hda "$IMAGE" -device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2 -usb -usbdevice tablet -net nic,macaddr=00:16:3e:60:0a:50,addr=0x3 -net tap,script=/etc/qemu-ifup  -vnc :1
Comment 20 Tomi Sarvela 2018-04-09 09:31:18 UTC
This qemu command line is used to boot fi-skl-gvtdvm:

qemu-system-x86_64 -enable-kvm -vga cirrus -m 2048 -hda /home/testrunner/ubuntu-16.04.img -hdb /home/testrunner/linux_hd.img -device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x6 -usb -usbdevice tablet -net nic,macaddr=b8:ae:ed:7f:76:38 -net tap,script=/etc/qemu-ifup -serial stdio -cpu host -vnc :1 > guest.log 2>&1
Comment 27 Jani Saarinen 2018-04-18 17:36:07 UTC
Pinged gvtdvm folks, waiting for info.
Comment 28 Jani Saarinen 2018-04-20 19:06:38 UTC
Moved to gvt team.
Comment 29 XiongZhang 2018-04-23 02:20:05 UTC
Could you try to modify Qemu command:

qemu-system-x86_64 -enable-kvm -nographic -vga none -m 2048 -hda /home/testrunner/ubuntu-16.04.img -hdb /home/testrunner/linux_hd.img -device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2 -usb -usbdevice tablet -net nic,macaddr=b8:ae:ed:7f:76:38,addr=0x3 -net tap,script=/etc/qemu-ifup -serial stdio -cpu host -vnc :1 > guest.log 2>&1
Comment 30 Jani Saarinen 2018-04-23 06:24:40 UTC
+ Tomi
Comment 31 XiongZhang 2018-04-23 06:45:46 UTC
qemu-system-x86_64 -enable-kvm -nographic -vga none -m 2048 -hda
/home/testrunner/ubuntu-16.04.img -hdb /home/testrunner/linux_hd.img -device
 vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2 -usb -usbdevice tablet
-net nic,macaddr=b8:ae:ed:7f:76:38,addr=0x3 -net tap,script=/etc/qemu-ifup
-cpu host -vnc :1 > guest.log 2>&1

(In reply to XiongZhang from comment #29)
> Could you try to modify Qemu command:
> 
> qemu-system-x86_64 -enable-kvm -nographic -vga none -m 2048 -hda
> /home/testrunner/ubuntu-16.04.img -hdb /home/testrunner/linux_hd.img -device
> vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2 -usb -usbdevice tablet
> -net nic,macaddr=b8:ae:ed:7f:76:38,addr=0x3 -net tap,script=/etc/qemu-ifup
> -serial stdio -cpu host -vnc :1 > guest.log 2>&1
Comment 32 Jani Saarinen 2018-04-27 14:48:47 UTC
Tomi, any changes done here on our CI?
Comment 33 Tomi Sarvela 2018-06-01 09:17:55 UTC
The following changes were made to fi-skl-gvtd and fi-bdw-gvtd qemu commandline:
OLD
NEW

-vga cirrus
-nographic -vga none

-device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x6
-device vfio-pci,host=00:02.0,id=hostdev0,bus=pci.0,addr=0x2

-net nic,macaddr=b8:ae:ed:7f:76:38
-net nic,macaddr=b8:ae:ed:7f:76:38,addr=0x3

-serial stdio
(removed)

The changes will be active from CI_DRM_4270 onwards.
Comment 34 Tomi Sarvela 2018-06-01 11:36:57 UTC
This change caused at least  two instances of BDW-GVTDVM to spam lines:
*ERROR* Fault errors on pipe A: 0x00000100
Comment 36 Tomi Sarvela 2018-06-12 07:31:26 UTC
BDW-GVTDVM and SKL-GVTDVM qemu command line invocation changed as it was, and incomplete issues seemed to disappear.
Comment 37 Martin Peres 2018-07-18 08:17:29 UTC
(In reply to Tomi Sarvela from comment #36)
> BDW-GVTDVM and SKL-GVTDVM qemu command line invocation changed as it was,
> and incomplete issues seemed to disappear.

Still hapenning 100% of the time when running igt@gem_exec_reuse@contexts:

 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_87/fi-bdw-gvtdvm/igt@gem_exec_reuse@contexts.html
 - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_85/fi-skl-gvtdvm/igt@gem_exec_reuse@contexts.html

Can you please check it out?
Comment 38 Martin Peres 2018-07-18 08:20:55 UTC
On a positive note, the "*ERROR* Fault errors on pipe A: 0x00000100" has not been hit in over a month.
Comment 39 Chris Wilson 2018-07-18 08:34:20 UTC
(In reply to Martin Peres from comment #37)
> (In reply to Tomi Sarvela from comment #36)
> > BDW-GVTDVM and SKL-GVTDVM qemu command line invocation changed as it was,
> > and incomplete issues seemed to disappear.
> 
> Still hapenning 100% of the time when running igt@gem_exec_reuse@contexts:
> 
>  -
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_87/fi-bdw-gvtdvm/
> igt@gem_exec_reuse@contexts.html
>  -
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_85/fi-skl-gvtdvm/
> igt@gem_exec_reuse@contexts.html
> 
> Can you please check it out?

That's a regression common to all machines,

<3>[  126.722643] active_instance:969 GEM_BUG_ON(list_empty(&vma->last_active.link))
<4>[  126.722783] ------------[ cut here ]------------
<2>[  126.722785] kernel BUG at drivers/gpu/drm/i915/i915_vma.c:969!
<4>[  126.722806] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4>[  126.722813] CPU: 0 PID: 1536 Comm: gem_exec_reuse Tainted: G     U            4.18.0-rc4-g1a9b9e1bf951-drmtip_87+ #1
<4>[  126.722834] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
<4>[  126.722895] RIP: 0010:i915_vma_move_to_active+0x6f0/0x700 [i915]
<4>[  126.722902] Code: 6f 8d cd c6 48 8b 35 37 c0 1a 00 49 c7 c0 98 6b 53 c0 b9 c9 03 00 00 48 c7 c2 70 11 50 c0 48 c7 c7 4b 54 42 c0 e8 10 1e d4 c6 <0f> 0b 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 41 57 41 56 41 55 
<4>[  126.722961] RSP: 0018:ffffa867c079fa90 EFLAGS: 00010286
<4>[  126.722968] RAX: 000000000000000e RBX: ffff91a42f82d008 RCX: 0000000000000000
<4>[  126.722975] RDX: 0000000000000001 RSI: 0000000000000008 RDI: ffff91a43d023a78
<4>[  126.722982] RBP: ffff91a3cfa91fc0 R08: 0000000000033c72 R09: ffff91a43d02a000
<4>[  126.722989] R10: 0000000000000000 R11: ffff91a43d023a78 R12: ffff91a430d70680
<4>[  126.722996] R13: ffff91a43934c040 R14: ffff91a43934c158 R15: 000000000000004f
<4>[  126.723008] FS:  00007f77b23c4980(0000) GS:ffff91a43fc00000(0000) knlGS:0000000000000000
<4>[  126.723016] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  126.723023] CR2: 00007fab6f1ef470 CR3: 00000000759d6003 CR4: 00000000003606f0
<4>[  126.723030] Call Trace:
<4>[  126.723068]  i915_gem_do_execbuffer+0x1239/0x18a0 [i915]
<4>[  126.723080]  ? trace_hardirqs_on_thunk+0x1a/0x1c
<4>[  126.723089]  ? retint_kernel+0x2d/0x2d
<4>[  126.723123]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[  126.723160]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[  126.723194]  i915_gem_execbuffer2_ioctl+0x21b/0x3f0 [i915]
<4>[  126.723229]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[  126.723237]  drm_ioctl_kernel+0x7c/0xf0
<4>[  126.723244]  drm_ioctl+0x2e6/0x3a0
<4>[  126.723276]  ? i915_gem_execbuffer_ioctl+0x300/0x300 [i915]
<4>[  126.723285]  ? _raw_spin_unlock_irq+0x24/0x50
<4>[  126.723292]  ? trace_hardirqs_on_caller+0xe0/0x1b0
<4>[  126.723300]  do_vfs_ioctl+0xa0/0x6d0
<4>[  126.723307]  ? __schedule+0x36c/0xbf0
<4>[  126.723313]  ksys_ioctl+0x35/0x60
<4>[  126.723319]  __x64_sys_ioctl+0x11/0x20
<4>[  126.723325]  do_syscall_64+0x55/0x190
<4>[  126.723331]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4>[  126.723337] RIP: 0033:0x7f77b1c8d5d7
<4>[  126.723341] Code: b3 66 90 48 8b 05 b1 48 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 81 48 2d 00 f7 d8 64 89 01 48 
<4>[  126.723401] RSP: 002b:00007ffcdaaf8b08 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>[  126.723410] RAX: ffffffffffffffda RBX: 00007ffcdaaf9c40 RCX: 00007f77b1c8d5d7
<4>[  126.723417] RDX: 00007ffcdaaf8bd0 RSI: 0000000040406469 RDI: 0000000000000003
<4>[  126.723424] RBP: 00007ffcdaaf8bd0 R08: 000000000000000d R09: 00007ffcdab3f080
<4>[  126.723431] R10: 00000000ffffffd2 R11: 0000000000000246 R12: 0000000040406469
<4>[  126.723438] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000000

and needs to be filed separately.
Comment 40 Chris Wilson 2018-07-19 11:33:54 UTC
(In reply to Chris Wilson from comment #39)
> (In reply to Martin Peres from comment #37)
> > (In reply to Tomi Sarvela from comment #36)
> > > BDW-GVTDVM and SKL-GVTDVM qemu command line invocation changed as it was,
> > > and incomplete issues seemed to disappear.
> > 
> > Still hapenning 100% of the time when running igt@gem_exec_reuse@contexts:
> > 
> >  -
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_87/fi-bdw-gvtdvm/
> > igt@gem_exec_reuse@contexts.html
> >  -
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_85/fi-skl-gvtdvm/
> > igt@gem_exec_reuse@contexts.html
> > 
> > Can you please check it out?
> 
> That's a regression common to all machines,

commit 46b1063f91e5680eff2a6b9fe432f6aae3bef54e (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Jul 19 08:22:06 2018 +0100

    drm/i915: Handle recursive shrinker for vma->last_active allocation
    
    If we call into the shrinker for direct relcaim inside kmalloc, it will
    retire the requests. If we retire the vma->last_active while processing a
    new i915_vma_move_to_active() we can upset the delicate bookkeeping
    required for the cache. After the possible invocation of the shrinker, we
    need to double check the vma->last_active is still valid.
    
    Fixes: 8b293eb53a7d ("drm/i915: Track the last-active inside the i915_vma")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=105600#c39
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180719072206.16015-1-chris@chris-wilson.co.uk
Comment 41 Lakshmi 2019-05-31 10:27:08 UTC
From the last 5 months, the reproduction rate of this failure is once in 5.25 drmtip runs. Last seen on drmtip 233. 
According to 10x rule, Closing this bug as WORKSFORME.
Comment 42 CI Bug Log 2019-05-31 10:27:38 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.