Bug 103927 - [CI][BXT only] igt@* - incomplete timeout/system hang
Summary: [CI][BXT only] igt@* - incomplete timeout/system hang
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Stanislav Lisovskiy
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks: 105984
  Show dependency treegraph
 
Reported: 2017-11-27 08:20 UTC by Marta Löfstedt
Modified: 2018-11-08 10:07 UTC (History)
1 user (show)

See Also:
i915 platform: BXT
i915 features: display/Other, GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Marta Löfstedt 2017-11-27 08:20:02 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3385/shard-apl3/igt@gem_exec_reloc@basic-wc-gtt-active.html

dmesg:
<5>[  703.500041] owatch: Using watchdog device /dev/watchdog0
<5>[  703.500147] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[  703.501677] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
...
<4>[  953.947982] hpet1: lost 7161 rtc interrupts
<4>[  954.837466] hpet1: lost 7161 rtc interrupts
<4>[  955.728351] hpet1: lost 7160 rtc interrupts

run.log:
running: igt/gem_exec_reloc/basic-wc-gtt-active

[48/75] skip: 16, pass: 31, fail: 1 |          
FATAL: command execution failed
java.io.EOFException
...
Finished: FAILURE
Completed CI_IGT_test CI_DRM_3385/shard-apl3/20 : FAILURE
CI_IGT_test runtime 328 seconds
Rebooting shard-apl3
Comment 1 Marta Löfstedt 2017-11-27 08:21:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3388/shard-apl2/igt@drv_selftest@mock_breadcrumbs.html

dmesg:
<5>[ 1702.026113] owatch: Using watchdog device /dev/watchdog0
<5>[ 1702.029167] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[ 1702.067375] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
...
<4>[ 1830.939223] hpet1: lost 7160 rtc interrupts
<4>[ 1831.824734] hpet1: lost 7161 rtc interrupts
<4>[ 1832.710064] hpet1: lost 7161 rtc interrupts

run.log:
running: igt/drv_selftest/mock_breadcrumbs

[05/43] pass: 4, dmesg-warn: 1 /          
FATAL: command execution failed
java.io.EOFException
...
Finished: FAILURE
Completed CI_IGT_test CI_DRM_3388/shard-apl2/35 : FAILURE
CI_IGT_test runtime 190 seconds
Rebooting shard-apl2
Comment 2 Marta Löfstedt 2017-11-30 13:34:43 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3416/shard-apl6/igt@gem_tiled_swapping@non-threaded.html


dmesg:
<5>[ 1048.871613] owatch: Using watchdog device /dev/watchdog0
<5>[ 1048.871720] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[ 1048.873275] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
...
<7>[ 1203.843465] [drm:drm_dp_read_desc] DP branch: OUI 00-00-00 dev-ID  HW-rev 0.0 SW-rev 0.0 quirks 0x0000
<7>[ 1203.843520] [drm:intel_dp_detect [i915]] Sink is not MST capable
<7>[ 1203.848943] [drm:drm_helper_hpd_irq_event] [CONNECTOR:72:DP-2] status updated from connected to connected
<7>[ 1203.991750] [IGT] gem_tiled_swapping: executing

no indication of timeout or softdog in run.log, i.e. looks like system hang.
Comment 3 Marta Löfstedt 2017-12-04 07:31:37 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3434/shard-apl5/igt@kms_flip@2x-flip-vs-rmfb-interruptible.html

dmesg:
<5>[ 1128.849581] owatch: Using watchdog device /dev/watchdog0
<5>[ 1128.849994] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[ 1128.851845] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
<4>[ 1129.140345] hpet1: lost 7161 rtc interrupts
...
<4>[ 1465.370844] hpet1: lost 7161 rtc interrupts
<4>[ 1466.262777] hpet1: lost 7161 rtc interrupts
<4>[ 1467.154938] hpet1: lost 7161 rtc interrupts
Folloed by "stray"

run.log:
running: igt/kms_flip/2x-flip-vs-rmfb-interruptible

[20/75] skip: 9, pass: 11 |                        
FATAL: command execution failed
...
Finished: FAILURE
Completed CI_IGT_test CI_DRM_3434/shard-apl5/28 : FAILURE
CI_IGT_test runtime 388 seconds
Rebooting shard-apl5
Comment 5 Marta Löfstedt 2018-01-02 07:50:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3574/shard-apl1/igt@gem_cs_tlb@render.html

run.log:
running: igt/gem_exec_flush/basic-uc-set-default

[46/76] skip: 22, pass: 24 -                    
FATAL: command execution failed
...
Completed CI_IGT_test CI_DRM_3574/shard-apl1/26 : FAILURE
CI_IGT_test runtime 547 seconds
Rebooting shard-apl1

dmesg:
<4>[  330.029664] Setting dangerous option reset - tainting kernel
<7>[  330.032933] [IGT] gem_exec_flush: starting subtest basic-uc-set-default
<7>[  335.431877] [IGT] gem_exec_flush: exiting, ret=0
<7>[  335.627075] [IGT] gem_cs_tlb: executing
<7>[  335.644193] [IGT] gem_cs_tlb: starting subtest render
Comment 6 Marta Löfstedt 2018-01-12 12:28:11 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3624/shard-apl6/igt@drv_selftest@live_hugepages.html

run.log:
running: igt/drv_selftest/live_gtt

[20/43] pass: 19, dmesg-warn: 1 | 
FATAL: command execution failed
...
Completed CI_IGT_test CI_DRM_3624/shard-apl6/35 : FAILURE
CI_IGT_test runtime 461 seconds
Rebooting shard-apl6

there is pstore:
<0>[  431.286708] drv_self-5805    1..s1 253153398us : execlists_submission_tasklet: vecs0 in[0]:  ctx=2.1, seqno=1
...
<0>[  431.414112]   <idle>-0       0.Ns1 361042309us : execlists_submission_tasklet: vcs0 out[0]: ctx=3.1, seqno=1358
<0>[  431.414242] ---------------------------------
<0>[  431.414306] Kernel Offset: disabled

dmesg:
<7>[  257.267285] [IGT] drv_selftest: starting subtest live_gtt
...
<7>[  341.784308] [IGT] drv_selftest: starting subtest live_hugepages
...
<6>[  341.982461] failed to allocate THP, finishing test early
<5>[  342.986261] igt_write_huge timed out on engine=1, offset_low=22bd0000 offset_high=ffffdd230000, max_page_size=10000
Comment 7 Marta Löfstedt 2018-01-18 07:34:36 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4148/shard-apl8/igt@perf_pmu@busy-double-start-vecs0.html

there is pstore, but it is hard to interpret since it is dominated by ftrace.
Comment 8 Marta Löfstedt 2018-01-24 09:00:23 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4172/shard-apl5/igt@kms_cursor_crc@cursor-256x256-suspend.html

<1>[    5.150957] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[    5.150971] IP: hda_widget_sysfs_reinit+0x13c/0x200 [snd_hda_core]
<6>[    5.150973] PGD 0 P4D 0 
<4>[    5.150977] Oops: 0000 [#1] PREEMPT SMP PTI
<4>[    5.150979] Modules linked in: snd_hda_codec_hdmi(+) snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 lpc_ich mii snd_pcm mei_me mei prime_numbers pinctrl_broxton pinctrl_intel
<4>[    5.151001] CPU: 3 PID: 253 Comm: systemd-udevd Not tainted 4.15.0-rc9-CI-CI_DRM_3673+ #1
<4>[    5.151002] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[    5.151008] RIP: 0010:hda_widget_sysfs_reinit+0x13c/0x200 [snd_hda_core]
<4>[    5.151009] RSP: 0018:ffffc900003e3b80 EFLAGS: 00010246
<4>[    5.151011] RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffff88025ff69b58
<4>[    5.151012] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: ffffffff82246460
<4>[    5.151014] RBP: ffff880265b84548 R08: 0000000000000001 R09: ffff88026e3fa198
<4>[    5.151015] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
<4>[    5.151016] R13: 0000000000000038 R14: ffffffffa00a67c0 R15: ffff88026a193e38
<4>[    5.151017] FS:  00007fbaf82658c0(0000) GS:ffff88027fd80000(0000) knlGS:0000000000000000
<4>[    5.151018] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[    5.151020] CR2: 0000000000000000 CR3: 000000027326c000 CR4: 00000000003406e0
<4>[    5.151021] Call Trace:
<4>[    5.151030]  snd_hdac_refresh_widgets+0x65/0x90 [snd_hda_core]
<4>[    5.151040]  snd_hda_codec_update_widgets+0xe/0x50 [snd_hda_codec]
<4>[    5.151047]  intel_hsw_common_init.part.20+0x2b/0x90 [snd_hda_codec_hdmi]
<4>[    5.151053]  hda_codec_driver_probe+0x5e/0x100 [snd_hda_codec]
<4>[    5.151059]  driver_probe_device+0x2a3/0x480
<4>[    5.151063]  __driver_attach+0xd9/0xe0
<4>[    5.151066]  ? driver_probe_device+0x480/0x480
<4>[    5.151067]  bus_for_each_dev+0x57/0x90
<4>[    5.151070]  bus_add_driver+0x168/0x260
<4>[    5.151074]  driver_register+0x52/0xc0
<4>[    5.151075]  ? 0xffffffffa03a1000
<4>[    5.151078]  do_one_initcall+0x39/0x150
<4>[    5.151082]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[    5.151085]  ? kmem_cache_alloc_trace+0x248/0x2b0
<4>[    5.151088]  do_init_module+0x56/0x1ef
<4>[    5.151091]  load_module+0x231c/0x2d70
<4>[    5.151096]  ? show_coresize+0x20/0x20
<4>[    5.151098]  ? vfs_read+0x126/0x150
<4>[    5.151106]  ? SyS_finit_module+0xa5/0xe0
<4>[    5.151108]  SyS_finit_module+0xa5/0xe0
<4>[    5.151115]  do_syscall_64+0x59/0x1a0
<4>[    5.151119]  entry_SYSCALL64_slow_path+0x25/0x25
<4>[    5.151121] RIP: 0033:0x7fbaf70ee9f9
<4>[    5.151122] RSP: 002b:00007ffefb0c8e58 EFLAGS: 00000202 ORIG_RAX: 0000000000000139
<4>[    5.151124] RAX: ffffffffffffffda RBX: 00005591f5ae89f0 RCX: 00007fbaf70ee9f9
<4>[    5.151125] RDX: 0000000000000000 RSI: 00007fbaf7a11e23 RDI: 0000000000000007
<4>[    5.151126] RBP: 00007fbaf7a11e23 R08: 0000000000000000 R09: 0000000000000000
<4>[    5.151127] R10: 0000000000000007 R11: 0000000000000202 R12: 0000000000000000
<4>[    5.151128] R13: 00005591f5aac150 R14: 0000000000020000 R15: 00005591f4a6fe30
<4>[    5.151135] Code: e4 49 c7 c6 c0 67 0a a0 83 e8 01 4c 8d 2c c5 08 00 00 00 eb 2b 48 8b 8d a8 04 00 00 0f b7 d3 83 c3 01 29 c2 48 63 c2 48 8b 51 10 <48> 8b 14 c2 49 8b 47 10 4a 89 14 20 49 83 c4 08 4d 39 e5 74 36 
<1>[    5.151183] RIP: hda_widget_sysfs_reinit+0x13c/0x200 [snd_hda_core] RSP: ffffc900003e3b80
<4>[    5.151184] CR2: 0000000000000000
<4>[    5.151187] ---[ end trace 8765903cf6e4535d ]---
Comment 9 Marta Löfstedt 2018-01-25 12:47:08 UTC
This is a Meta bug to capture all unexplained incompletes on APL-shards
Comment 10 Marta Löfstedt 2018-01-31 14:42:00 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3706/shard-apl3/igt@perf_pmu@most-busy-check-all-vecs0.html

pstore only has ftrace info. 

Note CI_DRM_3704 - CI_DRM_3706 looks strange, lots of not run tests.
Comment 11 Marta Löfstedt 2018-02-05 07:25:31 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3718/shard-apl4/igt@perf_pmu@busy-double-start-vcs0.html

This has pstore which unfortunately only has ftrace information.
Comment 13 Marta Löfstedt 2018-03-22 06:59:47 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3964/fi-bxt-dsi/igt@kms_pipe_crc_basic@suspend-read-crc-pipe-a.html

running: igt/kms_pipe_crc_basic/suspend-read-crc-pipe-a

[240/285] skip: 26, pass: 214 |                        
FATAL: command execution failed
...
Completed CI_IGT_test CI_DRM_3964/fi-bxt-dsi/0 : FAILURE
CI_IGT_test runtime 562 seconds
Rebooting fi-bxt-dsi

Last dmesg:
<7>[  450.080918] [IGT] kms_pipe_crc_basic: starting subtest suspend-read-crc-pipe-A
<7>[  450.081382] [drm:drm_mode_addfb2] [FB:107]
<7>[  450.109061] [drm:drm_mode_setcrtc] [CRTC:46:pipe A]
<7>[  450.109110] [drm:drm_mode_setcrtc] [CONNECTOR:98:DSI-1]
<7>[  450.127575] [drm:drm_mode_setcrtc] [CRTC:65:pipe B]
<7>[  450.129837] [drm:drm_mode_setcrtc] [CRTC:79:pipe C]
<7>[  450.226438] [drm:drm_mode_addfb2] [FB:106]
<7>[  450.244482] [drm:drm_mode_setcrtc] [CRTC:46:pipe A]
<7>[  450.244553] [drm:drm_mode_setcrtc] [CONNECTOR:98:DSI-1]
Comment 14 Marta Löfstedt 2018-03-26 06:52:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3974/fi-bxt-dsi/igt@kms_flip@basic-flip-vs-dpms.html

run.log:
running: igt/kms_flip/basic-flip-vs-dpms

[215/285] skip: 22, pass: 193 \         
FATAL: command execution failed
...
Completed CI_IGT_test CI_DRM_3974/fi-bxt-dsi/0 : FAILURE
CI_IGT_test runtime 461 seconds
Rebooting fi-bxt-dsi

Last dmesg:
<7>[  354.175870] [IGT] kms_flip: executing
...
<7>[  354.214763] [drm:drm_dp_i2c_do_msg] native defer
<7>[  354.215624] [drm:lspcon_wait_mode [i915]] Current LSPCON mode P
Comment 15 Stanislav Lisovskiy 2018-07-03 07:28:40 UTC
So far, I couldn't reproduce any of the issues reported here, using my BXT.

I took all the tests mentioned and run them around 100 times. So those are either some specific hardware/configuration issues or they are fixed.
Comment 16 Stanislav Lisovskiy 2018-07-06 08:35:38 UTC
There seems to be some problem with drv_selftests: it occassionally runs out of memory(memory leak?), which causes then all subsequent tests to fail:

<3>[  170.754357] Out of memory: Kill process 5670 (drv_selftest) score 999 or sacrifice child
<7>[  170.754702] [drm:vma_print_allocator.constprop.12 [i915]] vma.node [ffffe000 + 00001000] is pinned: inserted at  i915_gem_gtt_insert+0x135/0x510 [i915]
                   __i915_vma_do_pin+0x462/0xd90 [i915]
                   logical_render_ring_init+0x33e/0x470 [i915]
                   intel_engines_init+0x4f/0x120 [i915]
                   i915_gem_init+0x20b/0x6a0 [i915]
                   i915_driver_load+0xc3a/0x1070 [i915]
                   i915_pci_probe+0x29/0x90 [i915]
                   pci_device_probe+0xa1/0x130
                   driver_probe_device+0x306/0x480
                   __driver_attach+0xdb/0x100
                   bus_for_each_dev+0x74/0xc0
                   bus_add_driver+0x15f/0x250
                   driver_register+0x56/0xe0
                   do_one_initcall+0x58/0x370
                   do_init_module+0x56/0x1ea
                   load_module+0x24

<3>[  170.763513] Out of memory: Kill process 782 (java) score 12 or sacrifice child
<7>[  170.763902] [drm:vma_print_allocator.constprop.12 [i915]] vma.node [ffffe000 + 00001000] is pinned: inserted at  i915_gem_gtt_insert+0x135/0x510 [i915]
                   __i915_vma_do_pin+0x462/0xd90 [i915]
                   logical_render_ring_init+0x33e/0x470 [i915]
                   intel_engines_init+0x4f/0x120 [i915]
                   i915_gem_init+0x20b/0x6a0 [i915]
                   i915_driver_load+0xc3a/0x1070 [i915]
                   i915_pci_probe+0x29/0x90 [i915]
                   pci_device_probe+0xa1/0x130
                   driver_probe_device+0x306/0x480
                   __driver_attach+0xdb/0x100
                   bus_for_each_dev+0x74/0xc0
                   bus_add_driver+0x15f/0x250
                   driver_register+0x56/0xe0
                   do_one_initcall+0x58/0x370
                   do_init_module+0x56/0x1ea
                   load_module+0x24

And so on. I guess this could be one of the reasons for getting lots of incomplete tests.
Comment 17 Francesco Balestrieri 2018-11-08 09:46:15 UTC
> There seems to be some problem with drv_selftests: it occassionally runs out > of memory(memory leak?), which causes then all subsequent tests to fail

Was this ever followed-up? It sounds like it could be major.
Comment 18 Stanislav Lisovskiy 2018-11-08 10:07:38 UTC
This seems to be kind of generic bug, summarizing all test case incompletes, not just some specific failure, so I guess we will always have something here.

According to http://gfx-ci.fi.intel.com/cibuglog-ng/issue/138/history all the recent issues happen to ApolloLake. To me it seems that it either runs out of memory or there is a memory leak in self_tests. I will check it, since didn't have time to do that yet unfortunately.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.