Summary: | [CI][BAT] igt* incomplete - /dev/kmsg buffer overrun, some messages lost - pstore cleanout broken | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Marta Löfstedt <marta.lofstedt> | ||||
Component: | DRM/Intel | Assignee: | Tomi Sarvela <tomi.p.sarvela> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | critical | ||||||
Priority: | high | CC: | intel-gfx-bugs | ||||
Version: | DRI git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | ReadyForDev | ||||||
i915 platform: | ALL | i915 features: | |||||
Attachments: |
|
Description
Marta Löfstedt
2017-11-13 10:53:06 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7090/fi-bxt-dsi/igt@vgem_basic@dmabuf-export.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3340/shard-apl2/igt@drv_selftest@live_hangcheck.html https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7097/fi-bsw-n3050/igt@vgem_basic@dmabuf-export.html from pstore note: <44>[ 528.342933] systemd-journald[179]: /dev/kmsg buffer overrun, some messages lost. i.e. we can't trust the information. https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7090/fi-bxt-dsi/igt@vgem_basic@dmabuf-export.html from dmesg, there is no pstore available. ftrace "spamming" starts here: <7>[ 457.728243] [IGT] vgem_basic: executing <0>[ 471.342503] kms_psr_-4050 1..s1 444620858us : intel_lrc_irq_handler: bcs0 out[0]: ctx=3.1, seqno=28d1b then we get into a chain of BUG_ON backtraces: <3>[ 472.794818] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34 <1>[ 472.806539] BUG: unable to handle kernel paging request at fffffffefaacf48a <3>[ 472.811672] BUG: scheduling while atomic: kworker/1:15/3965/0x00000003 <4>[ 472.812493] WARNING: CPU: 1 PID: 3965 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x70/0x6c0 <3>[ 537.840247] INFO: rcu_preempt detected stalls on CPUs/tasks: <3>[ 537.840299] Tasks blocked on level-0 rcu_node (CPUs 0-3): P3965 <3>[ 537.840342] (detected by 0, t=65002 jiffies, g=28545, c=28544, q=29901) https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_492/fi-bxt-j4205/igt@vgem_basic@dmabuf-export.html dmesg: <7>[ 473.265673] [IGT] vgem_basic: starting subtest debugfs <4>[ 473.266084] INFO: trying to register non-static key. <4>[ 473.266097] the code is fine but needs lockdep annotation. <4>[ 473.266103] turning off the locking correctness validator. then ftrace spamming: <0>[ 473.266614] Dumping ftrace buffer: <0>[ 473.266661] --------------------------------- <0>[ 473.266848] CPU:1 [LOST 153991 EVENTS] <...>-3366 1..s1 325544570us : intel_lrc_irq_handler: vecs0 in[0]: ctx=2.1, seqno=1adce ... <0>[ 486.328080] <idle>-0 2..s1 472520536us : intel_lrc_irq_handler: vecs0 out[0]: ctx=3.1, seqno=23667 the the backtraces: <4>[ 486.328666] Call Trace: <4>[ 486.328683] ? drm_mode_gamma_get_ioctl+0xf0/0xf0 <3>[ 486.617585] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34 <5>[ 486.623297] systemd-journald[195]: /dev/kmsg buffer overrun, some messages lost. <1>[ 486.627944] BUG: unable to handle kernel paging request at fffffffeecf32030 <3>[ 486.633071] BUG: scheduling while atomic: kworker/2:2/205/0x00000003 <4>[ 486.633091] INFO: lockdep is turned off. <4>[ 486.633310] WARNING: CPU: 2 PID: 205 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x70/0x6c0 <7>[ 487.001473] [IGT] vgem_basic: executing <7>[ 487.030825] [IGT] vgem_basic: starting subtest dmabuf-export <3>[ 551.597068] INFO: rcu_preempt detected stalls on CPUs/tasks: <3>[ 551.597121] Tasks blocked on level-0 rcu_node (CPUs 0-3): P205 <3>[ 551.597164] (detected by 2, t=65002 jiffies, g=32449, c=32448, q=28149) Note pstore only has the first stuff: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_492/fi-bxt-j4205/dmesg-1510488286_Oops_1.log <0>[ 486.328080] <idle>-0 2..s1 472520536us : intel_lrc_irq_handler: vecs0 out[0]: ctx=3.1, seqno=23667 <0>[ 486.328117] --------------------------------- <4>[ 486.328138] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm r8169 mii mei_me prime_numbers mei i2c_hid pinctrl_broxton pinctrl_intel <4>[ 486.328309] CPU: 2 PID: 4341 Comm: vgem_basic Tainted: G U 4.14.0-rc8-CI-CI_DRM_3335+ #1 <4>[ 486.328343] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.10 09/29/2016 <4>[ 486.328379] task: ffff88026b45a780 task.stack: ffffc9000112c000 <4>[ 486.328408] RIP: 0010:drm_framebuffer_info+0x77/0xc0 <4>[ 486.328428] RSP: 0018:ffffc9000112fd68 EFLAGS: 00010286 <4>[ 486.328453] RAX: 0000000000000000 RBX: fffffffffffffff8 RCX: 0000000000000001 <4>[ 486.328479] RDX: 0000000080000001 RSI: 0000000000000000 RDI: 00000000ffffffff <4>[ 486.328506] RBP: ffffc9000112fda0 R08: 0000000000000000 R09: 0000000000000000 <4>[ 486.328532] R10: ffffc9000112fd58 R11: ffffffff81607ef8 R12: ffff8801fd432950 <4>[ 486.328559] R13: ffff8801fd4328c8 R14: ffffffff81ce969b R15: ffff88026674e2b8 <4>[ 486.328586] FS: 00007fa4579e08c0(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 <4>[ 486.328617] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 486.328640] CR2: 0000000000000010 CR3: 00000002649e3000 CR4: 00000000003406e0 <4>[ 486.328666] Call Trace: <4>[ 486.328683] ? drm_mode_gamma_get_ioctl+0xf0/0xf0 <4>[ 486.328709] seq_read+0xe1/0x3a0 <4>[ 486.328732] full_proxy_read+0x54/0x80 <4>[ 486.328753] __vfs_read+0x28/0x120 <4>[ 486.328772] ? putname+0x54/0x60 <4>[ 486.328791] ? rcu_read_lock_sched_held+0x7a/0x90 <4>[ 486.328814] ? entry_SYSCALL_64_fastpath+0x5/0xb1 <4>[ 486.328836] vfs_read+0xa8/0x150 <4>[ 486.328855] SyS_read+0x49/0xb0 <4>[ 486.328873] entry_SYSCALL_64_fastpath+0x1c/0xb1 <4>[ 486.328894] RIP: 0033:0x7fa4561c1500 <4>[ 486.328910] RSP: 002b:00007ffe7ac3d1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 <4>[ 486.328942] RAX: ffffffffffffffda RBX: ffffffff81492083 RCX: 00007fa4561c1500 <4>[ 486.328968] RDX: 000000000000003f RSI: 000055869cdacc50 RDI: 0000000000000005 <4>[ 486.328995] RBP: ffffc9000112ff88 R08: 0000000000000000 R09: 0000000000000000 <4>[ 486.329021] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005 <4>[ 486.329047] R13: 000055869cdacc50 R14: 0000000000000000 R15: 0000000000000000 <4>[ 486.329078] ? __this_cpu_preempt_check+0x13/0x20 <4>[ 486.329102] Code: 20 06 00 00 48 a9 f8 ff ff ff 74 63 49 8b 84 24 a8 06 00 00 49 81 c4 a8 06 00 00 49 c7 c6 9b 96 ce 81 49 39 c4 48 8d 58 f8 74 2d <8b> 53 18 48 8d 7d c8 4c 89 f6 e8 4a 5c 00 00 48 8d 7d c8 48 89 <1>[ 486.329327] RIP: drm_framebuffer_info+0x77/0xc0 RSP: ffffc9000112fd68 <4>[ 486.329351] CR2: 0000000000000010 <4>[ 486.329370] ---[ end trace c108931ffc3d15ee ]--- All the vgem_basic oops are commit 0c51ef8684e30333ee2a7cef359258d91727166b Author: Noralf Trønnes <noralf@tronnes.org> Date: Tue Nov 14 00:12:01 2017 +0100 drm/debugfs: Fix framebuffer debugfs file init The introduction of: drm/framebuffer: Add framebuffer debugfs file broke vgem. That patch assumed that all drivers had initialized the dev->mode_config.fb_lock mutex which happens in drm_mode_config_init(). vgem doesn't need to call drm_mode_config_init(). Fix this by only creating the framebuffer debugfs file for modesetting drivers. Fixes: 45d58b40292b ("drm/framebuffer: Add framebuffer debugfs file") Suggested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Noralf Trønnes <noralf@tronnes.org> Reviewed-by: Dave Airlie <airlied@redhat.com> Signed-off-by: Jani Nikula <jani.nikula@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171113231201.26177-1-noralf@tronnes.org The simple identifier for this is the RIP: drm_framebuffer_info Rising priority since it is BAT. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3344/shard-kbl5/igt@gem_cpu_reloc@full.html The dmesg has all patterns from bug 103170, i.e starts with: <3>[ 96.704827] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang: then: <3>[ 105.404218] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:56:pipe C] flip_done timed out this goes on for the rest of dmesg. However the pstore doesn't make sense, in relation to dmesg timestamps. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3344/shard-kbl5/dmesg-1510675260_Panic_1.log commit 9865a6840df65109a62dcee2f18e6e65d62ae511 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Nov 13 12:57:06 2017 +0000 ftrace: Allow configuring global trace buffer size (for dump-on-oops) Was integrated to CI_DRM_3344. Apparently it doesn't work since https://bugs.freedesktop.org/show_bug.cgi?id=103706#c16 is on CU_DRM_3344. Here is another one: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3348/shard-glkb4/igt@gem_exec_nop@signal-all.html Note, from CI_DRM_3349 the ftrace buffersize will be set to 8192, I file this here because I still believe we need to trim the ftrace buffer so that we get "header" of the BUG_ON backtrace. I this case I believe it is probably bug 102035. However, since we also have bug 103533, I would need to objdump the actual kernel that was run in order to figure out which line it was. I don't know if decreasing the CONFIG_GLOBAL_TRACE_BUF_SIZE further will help. Maybe the ftrace will always write over the expected: "<4>[ 607.672089] ------------[ cut here ]------------ <2>[ 607.672099] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891! <4>[ 607.672155] invalid opcode: 0000 [#1] PREEMPT SMP" https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3351/shard-apl2/igt@kms_chv_cursor_fail@pipe-c-128x128-bottom-edge.html <0>[ 327.861785] kms_chv_-2351 1..s1 327841533us : intel_lrc_irq_handler: rcs0 csb[0d]: status=0x00008002:0x00000003 <0>[ 327.861847] kms_chv_-2351 1..s1 327841534us : intel_lrc_irq_handler: rcs0 out[0]: ctx=3.1, seqno=19d37 <0>[ 327.861865] --------------------------------- <4>[ 327.861875] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec lpc_ich snd_hwdep r8169 mii snd_hda_core mei_me snd_pcm mei prime_numbers pinctrl_broxton pinctrl_intel <4>[ 327.861960] CPU: 1 PID: 2351 Comm: kms_chv_cursor_ Tainted: G U 4.14.0-CI-CI_DRM_3351+ #1 <4>[ 327.861977] Hardware name: /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017 <4>[ 327.861995] task: ffff8802342ccec0 task.stack: ffffc900002b0000 <4>[ 327.862049] RIP: 0010:intel_lrc_irq_handler+0x40f/0xa00 [i915] <4>[ 327.862061] RSP: 0018:ffff88027fc83ef0 EFLAGS: 00010202 <4>[ 327.862073] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 00000000000002b4 <4>[ 327.862086] RDX: 0000000000000004 RSI: ffff88027608f5b8 RDI: ffff88027608f4e8 <4>[ 327.862100] RBP: ffff88027fc83f58 R08: ffff8802761c3000 R09: 0000000000000002 <4>[ 327.862113] R10: 0000000000000000 R11: ffff8802703679c0 R12: ffff8802703679c0 <4>[ 327.862126] R13: ffff88027032d040 R14: ffff8802705f2158 R15: ffff8802705f2160 <4>[ 327.862140] FS: 00007f25fc1b9a40(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000 <4>[ 327.862155] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[ 327.862167] CR2: 00005598c236cc18 CR3: 000000023633a000 CR4: 00000000003406e0 <4>[ 327.862181] Call Trace: <4>[ 327.862190] <IRQ> <4>[ 327.862204] tasklet_hi_action+0x8e/0x110 <4>[ 327.862217] __do_softirq+0xc6/0x4af <4>[ 327.862228] do_softirq_own_stack+0x2a/0x40 <4>[ 327.862238] </IRQ> <4>[ 327.862282] ? __i915_add_request+0x1ee/0x460 [i915] <4>[ 327.862295] do_softirq.part.4+0x56/0x60 <4>[ 327.862306] __local_bh_enable_ip+0x8c/0xa0 <4>[ 327.862350] i915_gem_do_execbuffer+0xde4/0x1550 [i915] <4>[ 327.862371] ? lock_acquire+0xaf/0x200 <4>[ 327.862382] ? __might_fault+0x3e/0x90 <4>[ 327.862431] ? i915_gem_execbuffer+0x2b0/0x2b0 [i915] <4>[ 327.862476] i915_gem_execbuffer2+0x1d7/0x3c0 [i915] <4>[ 327.862488] ? __might_fault+0x3e/0x90 <4>[ 327.862532] ? i915_gem_execbuffer+0x2b0/0x2b0 [i915] <4>[ 327.862544] drm_ioctl_kernel+0x65/0xb0 <4>[ 327.862555] drm_ioctl+0x295/0x340 <4>[ 327.862598] ? i915_gem_execbuffer+0x2b0/0x2b0 [i915] <4>[ 327.862611] ? lock_acquire+0xaf/0x200 <4>[ 327.862622] ? task_work_run+0x7b/0xc0 <4>[ 327.862634] do_vfs_ioctl+0x8f/0x670 <4>[ 327.862644] ? entry_SYSCALL_64_fastpath+0x5/0xb1 <4>[ 327.862655] ? trace_hardirqs_on_caller+0xde/0x1c0 <4>[ 327.862667] SyS_ioctl+0x3b/0x70 <4>[ 327.862678] entry_SYSCALL_64_fastpath+0x1c/0xb1 <4>[ 327.862689] RIP: 0033:0x7f25fa3b8587 <4>[ 327.862697] RSP: 002b:00007ffc179775e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 <4>[ 327.862713] RAX: ffffffffffffffda RBX: ffffc900002b3ff0 RCX: 00007f25fa3b8587 <4>[ 327.862726] RDX: 00007ffc17977680 RSI: 0000000040406469 RDI: 0000000000000004 <4>[ 327.862739] RBP: 000000000000003c R08: 0000000000000008 R09: 0000000000000000 <4>[ 327.862752] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000046 <4>[ 327.862766] R13: 00007f25fa67cc20 R14: 0000000000000000 R15: 0000000000000000 <4>[ 327.862782] Code: e9 d8 fc ff ff 41 8b 46 24 44 8d a8 70 03 00 00 48 8b 45 c0 4c 03 a8 b8 0b 00 00 41 c7 86 9c 03 00 00 ff ff ff ff e9 80 fc ff ff <0f> 0b 0f 0b 0f 0b 49 8b 86 80 03 00 00 4d 89 f7 a8 02 74 20 41 <1>[ 327.862934] RIP: intel_lrc_irq_handler+0x40f/0xa00 [i915] RSP: ffff88027fc83ef0 <7>[ 327.863162] [drm:gen8_irq_handler [i915]] hotplug event received, stat 0x00000020, dig 0x10001918, pins 0x00000040 <7>[ 327.863268] [drm:intel_hpd_irq_handler [i915]] digital hpd port C - short <7>[ 327.863374] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port C - short <7>[ 327.863436] [drm:intel_power_well_enable [i915]] enabling always-on <7>[ 327.863486] [drm:intel_power_well_enable [i915]] enabling DC off <7>[ 327.863839] [drm:gen9_set_dc_state [i915]] Setting DC state from 01 to 00 <4>[ 327.863843] ---[ end trace e93b116545516f76 ]--- *** Bug 103660 has been marked as a duplicate of this bug. *** https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7210/fi-glk-dsi/igt@gem_exec_reloc@basic-cpu.html kconfig/debug: CONFIG_FTRACE=n This changes affects builds CI_DRM_3382 onwards. We'll see if this changes the unfortunate situation with pstore dumps. Marta, any updates on this? (In reply to Jani Saarinen from comment #28) > Marta, any updates on this? No, CI is completely broken on 4.15.0-rc1, so how could I The issue is still present: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3398/shard-kbl5/igt@gem_cs_tlb@render.html backtrace indicates softdog, but there is no room to get the "header" in due to too much ftrace stuff. bad case with pstores not matching dmesg, and/or destroyed by ftrace https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3401/shard-kbl1/igt@kms_flip@2x-plain-flip-ts-check.html This also is a pile of confusion. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3405/shard-glkb1/igt@kms_frontbuffer_tracking@psr-2p-rte.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3408/shard-apl3/igt@drv_selftest@live_gtt.html the dmesg timestamps don't match pstores so something appear to be wrong. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/igt@kms_flip@vblank-vs-dpms-suspend-interruptible.html none of the pstores match dmesg and are clearly from unrelated runs, since this run is first after reboot. (In reply to Marta Löfstedt from comment #35) > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/ > igt@kms_flip@vblank-vs-dpms-suspend-interruptible.html > > none of the pstores match dmesg and are clearly from unrelated runs, since > this run is first after reboot. Took out the HSW3 from the lab, there was no issue cleaning out pstore logs with pstore kernel. There was no issue cleaning out pstore logs that was force generated on drm kernel with pstore kernel. There was no issue cleaning out pstore gereated on drm kernel with drm kernel. So, maybe there is something wrong with the algorith that rename the pstore logs or there was something wrong with pstore clean out early on 4.15 which has now been fixed. This is strange. In the beginning of CI_IGT_test there is # sudo rm -rf /sys/fs/pstore/* || true and in the end of CI_IGT_collect (which runs if CI_IGT_test fails, timeouts, is aborted etc) # sudo mv -f /sys/fs/pstore/* ${LOGDIR}/${RUN}/|| true This has worked a long time like this, so my guess is still that something has changed in the kernel. In support of there was a pstore clearing out issue that is now solved: this pstore: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/pstore2-1512165009_Panic_1.log appear to be from this run: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3434/shard-hsw3/dmesg21.log no pstore logs was collected on CI_DRM_3434 This pstore: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3440/shard-hsw7/pstore0-1511883398_Panic_1.log appear to match: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3401/shard-hsw7/dmesg13.log This must be the record nr of pstore logs... https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3440/fi-glk-dsi/igt@gem_exec_suspend@basic-s3.html Here is an odd run: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/igt@gem_render_copy_redux@normal.html run.log: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/run15.log running: igt/kms_flip/vblank-vs-modeset-suspend [21/75] skip: 8, pass: 13 / FATAL: command execution failed java.io.EOFException indicates igt@kms_flip@vblank-vs-modeset-suspend was last test. However: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shards.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/igt@kms_flip@vblank-vs-modeset-suspend.html and there was no incompletes on that run, however, there are 2 pstores associated with this run: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/pstore15-1512101622_Oops_1.log https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/pstore15-1512101622_Panic_2.log https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3451/shard-kbl3/igt@gem_tiled_swapping@non-threaded.html looks like old pstores are still coming out. result from Meta test: http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/meta.html These system have no pstore logs, so out of some reason they don't work as they should. http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-ivb-3520m/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-byt-j1900/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bxt-dsi/igt@meta_test@piglit-timeout.html Well, I guess this is not to be expected to work for gvtdvm, right? http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-gvtdvm/igt@meta_test@piglit-timeout.html These have loads of irrelevant pstores, which we have now hopefully cleaned out. http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-ivb-3770/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-r/igt@meta_test@piglit-timeout.html These look OK, considering ftrace and KASAN was enabled. http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-5557u/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6700hq/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6700k/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6770hq/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bxt-j4205/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7500u/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7560u/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7567u/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-1/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html these say ftrace buffer is empty: http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-byt-n2820/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-hsw-4770/igt@meta_test@piglit-timeout.html http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-hsw-4770r/igt@meta_test@piglit-timeout.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3460/shard-apl1/igt@gem_mmap_gtt@basic-small-bo.html on a positive note, bug 104152 has a legit pstore. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3487/shard-kbl7/ has pstore logs, however there are no incompletes on the run, so why are there old pstores associated with it? The pstore is from previous runs: <4>[ 897.782937] CPU: 0 PID: 948 Comm: owatch Tainted: G U 4.15.0-rc2-CI-CI_DRM_3486+ #1 I'd be happy with any explanation why sudo rm -f /sys/fs/pstore/* in the beginning of the testing doesn't clear out the old files. I did a grep on the server for all pstores logs retrieved after CI_DRM_3499 until CI_DRM_3403 crazy amout of irrelevant pstore logs pulled out after incomplete: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/fi-glk-1/igt@gem_exec_suspend@basic-s3.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-glkb4/igt@gem_exec_suspend@basic-s3.html pstores pulled out on runs that didn't even incomplete, however run.log doesn't match results so probably killed by Jenkins: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-kbl3/igt@drv_suspend@fence-restore-untiled-hibernate.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-kbl7/igt@drv_suspend@sysfs-reader-hibernate.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-kbl6/igt@gem_exec_suspend@basic-s4.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-kbl2/igt@gem_workarounds@hibernate-resume-fd.html pstores pulled out on 17 minutes timeout: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/shard-kbl5/igt@drv_suspend@fence-restore-tiled2untiled-hibernate.html on this run no test was actually run however a bunch of pstores where pulled out: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3501/fi-glk-dsi/ i.e. not a single legit pstore was collected. I forced panic:ed shards-hsw4 when it was booted on 4.15.0-rc3-CI-Patchwork_7478+ The machine rebooted to pstore kernel: 4.15.0-rc3-CI-Patchwork_7478+ Welcome to Ubuntu 17.04 (GNU/Linux 4.11.0-rc3-pstore+ x86_64) * Documentation: https://help.ubuntu.com * Management: https://landscape.canonical.com * Support: https://ubuntu.com/advantage Last login: Wed Dec 13 10:14:49 2017 from 10.237.72.167 testrunner@hsw4:~$ uname -r 4.11.0-rc3-pstore+ testrunner@hsw4:~$ ls /sys/fs/pstore dmesg-efi-151315300001001 dmesg-efi-151315300202001 testrunner@hsw4:~$ cat /sys/fs/pstore/dmesg-efi-151315300001001 Oops#1 Part1 <4>[23212.086286] FS: 00007f69bbcc1700(0000) GS:ffff88041fb40000(0000) knlGS:0000000000000000 <4>[23212.086291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4>[23212.086294] CR2: 0000000000000000 CR3: 000000040b71b001 CR4: 00000000001606e0 <4>[23212.086298] Call Trace: <4>[23212.086301] __handle_sysrq+0x123/0x200 <4>[23212.086305] write_sysrq_trigger+0x43/0x50 <4>[23212.086308] proc_reg_write+0x38/0x60 <4>[23212.086313] __vfs_write+0x1e/0x130 <4>[23212.086317] ? rcu_read_lock_sched_held+0x6f/0x80 <4>[23212.086320] ? rcu_sync_lockdep_assert+0x25/0x50 <4>[23212.086323] ? __sb_start_write+0xd5/0x1f0 <4>[23212.086326] ? __sb_start_write+0xef/0x1f0 <4>[23212.086330] vfs_write+0xc0/0x1b0 <4>[23212.086334] SyS_write+0x40/0xa0 <4>[23212.086337] entry_SYSCALL_64_fastpath+0x1c/0x89 <4>[23212.086340] RIP: 0033:0x7f69bb7ed8f0 <4>[23212.086343] RSP: 002b:00007ffee4d1f668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 <4>[23212.086348] RAX: ffffffffffffffda RBX: 000055c7c3c86bf0 RCX: 00007f69bb7ed8f0 <4>[23212.086352] RDX: 0000000000000002 RSI: 00007ffee4d1f750 RDI: 0000000000000003 <4>[23212.086356] RBP: 000055c7c3c86be8 R08: 000055c7c3c86d70 R09: 00007f69bbcc1700 <4>[23212.086360] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f69bbcdd1ef <4>[23212.086364] R13: 00007f69bbcdd1ec R14: 00007f69bbcdd000 R15: 0000000000000001 <4>[23212.086369] Code: 33 e8 e9 72 b9 ff 48 c7 c2 52 c8 55 81 be 01 00 00 00 48 c7 c7 c0 49 e4 81 e8 91 a2 b7 ff c7 05 57 e5 e7 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 b6 9e b9 ff 84 c0 75 c4 48 c7 c2 <1>[23212.086402] RIP: sysrq_handle_crash+0x3c/0x70 RSP: ffffc90001acfe10 <4>[23212.086406] CR2: 0000000000000000 <4>[23212.086409] ---[ end trace a1025166a09c59fd ]--- testrunner@hsw4:~$ cat /sys/fs/pstore/dmesg-efi-151315300202001 Oops#1 Part2 <6>[23212.086128] sysrq: SysRq : Trigger a crash <1>[23212.086188] BUG: unable to handle kernel NULL pointer dereference at (null) <1>[23212.086196] IP: sysrq_handle_crash+0x3c/0x70 <6>[23212.086199] PGD 0 P4D 0 <4>[23212.086211] Oops: 0002 [#1] PREEMPT SMP <0>[23212.086215] Dumping ftrace buffer: <0>[23212.086220] (ftrace buffer empty) <4>[23212.086222] Modules linked in: snd_hda_intel i915 i2c_dev vgem snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_codec_realtek intel_powerclamp snd_hda_codec_generic coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm lpc_ich mii prime_numbers mei_me mei [last unloaded: i915] <4>[23212.086249] CPU: 5 PID: 2939 Comm: tee Tainted: G U 4.15.0-rc3-CI-Patchwork_7478+ #1 <4>[23212.086254] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016 <4>[23212.086259] RIP: 0010:sysrq_handle_crash+0x3c/0x70 <4>[23212.086262] RSP: 0018:ffffc90001acfe10 EFLAGS: 00010282 <4>[23212.086266] RAX: ffff88040d328040 RBX: 0000000000000063 RCX: 1fcaac1f00000000 <4>[23212.086270] RDX: 0000000000000001 RSI: 00000000469deb48 RDI: 0000000000000000 <4>[23212.086274] RBP: ffffc90001acfe38 R08: ffff88040d328908 R09: 0000000067d877a6 <4>[23212.086278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004 <4>[23212.086282] R13: ffffffff81e9abc0 R14: 0000000000000000 R15: 00007f69bbab28c0 so, let's see when above will be pulled out. testrunner@hsw4:~$ uname -r 4.15.0-rc3-CI-CI_DRM_3504+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore/ dmesg-efi-151315300001001 dmesg-efi-151315300202001 my pstores are still there... testrunner@hsw4:~$ uname -r 4.11.0-rc3-pstore+ testrunner@hsw4:~$ ls /sys/fs/pstore dmesg-efi-151315300001001 dmesg-efi-151315300202001 testrunner@hsw4:~$ sudo rm /sys/fs/pstore/* testrunner@hsw4:~$ ls /sys/fs/pstore testrunner@hsw4:~$ sudo reboot then it reboots to generic kernel, which of course doesn't have PSTORE_EFI_VARS enabled: testrunner@hsw4:~$ uname -r 4.10.0-19-generic testrunner@hsw4:~$ sudo ls /sys/fs/pstore then it was picked up by IGT on 3504 testrunner@hsw4:~$ uname -r 4.15.0-rc3-CI-CI_DRM_3504+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore testrunner@hsw4:~$ i.e. deleting with pstore kernel appear to work, so why didn't reboot to pstore between runs before really clear out pstore? Here is a new occurence where old pstores are pulled out. https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4060/shard-glkb2/igt@kms_flip@flip-vs-panning-vs-hang-interruptible.html note that last test in run.log doean't match the result. So, this is probably a Jenkins reboot on network issues. Note, I removed the impact in cibuglog on this bug. There is no point in causing reduced coverage on a lot of machines just because of old psotres broken by ftrace and now pstore not working at all. So, currently in cibuglog this bug is only tracked on one machine that actually isn't even in the lab. Now there appear to be some real stuff on it, lets see how this will pan out it is currently running shard/29 on CI_DRM_3506 4.15.0-rc3-CI-CI_DRM_3506+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore dmesg-efi-151317130101001 dmesg-efi-151317130105001 dmesg-efi-151317130109001 dmesg-efi-151317130113001 dmesg-efi-151317130117001 dmesg-efi-151317130102001 dmesg-efi-151317130106001 dmesg-efi-151317130110001 dmesg-efi-151317130114001 dmesg-efi-151317130118001 dmesg-efi-151317130103001 dmesg-efi-151317130107001 dmesg-efi-151317130111001 dmesg-efi-151317130115001 dmesg-efi-151317130119001 dmesg-efi-151317130104001 dmesg-efi-151317130108001 dmesg-efi-151317130112001 dmesg-efi-151317130116001 now HSW4 is running shard 35 on CI_DRM_3506, but pstore stuff from before is still there: testrunner@hsw4:~$ uname -r 4.15.0-rc3-CI-CI_DRM_3506+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore dmesg-efi-151317130101001 dmesg-efi-151317130105001 dmesg-efi-151317130109001 dmesg-efi-151317130113001 dmesg-efi-151317130117001 dmesg-efi-151317130102001 dmesg-efi-151317130106001 dmesg-efi-151317130110001 dmesg-efi-151317130114001 dmesg-efi-151317130118001 dmesg-efi-151317130103001 dmesg-efi-151317130107001 dmesg-efi-151317130111001 dmesg-efi-151317130115001 dmesg-efi-151317130119001 dmesg-efi-151317130104001 dmesg-efi-151317130108001 dmesg-efi-151317130112001 dmesg-efi-151317130116001 Now it's running CI_DRM_3507, the pstore stuff is still there: testrunner@hsw4:~$ uname -r 4.15.0-rc3-CI-CI_DRM_3507+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore/ dmesg-efi-151317130101001 dmesg-efi-151317130105001 dmesg-efi-151317130109001 dmesg-efi-151317130113001 dmesg-efi-151317130117001 dmesg-efi-151317130102001 dmesg-efi-151317130106001 dmesg-efi-151317130110001 dmesg-efi-151317130114001 dmesg-efi-151317130118001 dmesg-efi-151317130103001 dmesg-efi-151317130107001 dmesg-efi-151317130111001 dmesg-efi-151317130115001 dmesg-efi-151317130119001 dmesg-efi-151317130104001 dmesg-efi-151317130108001 dmesg-efi-151317130112001 dmesg-efi-151317130116001 Now its running PW 7483: testrunner@hsw4:~$ uname -r 4.15.0-rc3-CI-Patchwork_7483+ testrunner@hsw4:~$ sudo ls /sys/fs/pstore/ dmesg-efi-151317130101001 dmesg-efi-151317130105001 dmesg-efi-151317130109001 dmesg-efi-151317130113001 dmesg-efi-151317130117001 dmesg-efi-151317130102001 dmesg-efi-151317130106001 dmesg-efi-151317130110001 dmesg-efi-151317130114001 dmesg-efi-151317130118001 dmesg-efi-151317130103001 dmesg-efi-151317130107001 dmesg-efi-151317130111001 dmesg-efi-151317130115001 dmesg-efi-151317130119001 dmesg-efi-151317130104001 dmesg-efi-151317130108001 dmesg-efi-151317130112001 dmesg-efi-151317130116001 Created attachment 136140 [details]
pstore on hsw4 when Ci_DRM_3506 was running
I copied out attached pstore when HSW4 was running CI_DRM_3506.
Note this pstore match dmesg of CI_DRM_3506/shard-hsw4/12/ perfectly
so, there appear to be no issue with writing the pstore.
However, the same files are still on the machine when it was running 4.15.0-rc3-CI-Patchwork_7484+
This has been fixed by updating permissions for pstore collection |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.