Bug 103706

Summary: [CI][BAT] igt* incomplete - /dev/kmsg buffer overrun, some messages lost - pstore cleanout broken
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Tomi Sarvela <tomi.p.sarvela>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ALL i915 features:
Attachments:
Description Flags
pstore on hsw4 when Ci_DRM_3506 was running none

Description Marta Löfstedt 2017-11-13 10:53:06 UTC
This bug will be used for various incompletes with confusing information in pstore.
Comment 11 Marta Löfstedt 2017-11-14 07:36:49 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7097/fi-bsw-n3050/igt@vgem_basic@dmabuf-export.html

from pstore note:
<44>[  528.342933] systemd-journald[179]: /dev/kmsg buffer overrun, some messages lost.

i.e. we can't trust the information.
Comment 12 Marta Löfstedt 2017-11-14 07:45:11 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_7090/fi-bxt-dsi/igt@vgem_basic@dmabuf-export.html

from dmesg, there is no pstore available.
ftrace "spamming" starts here:
<7>[  457.728243] [IGT] vgem_basic: executing
<0>[  471.342503] kms_psr_-4050    1..s1 444620858us : intel_lrc_irq_handler: bcs0 out[0]: ctx=3.1, seqno=28d1b

then we get into a chain of BUG_ON backtraces:
<3>[  472.794818] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34

<1>[  472.806539] BUG: unable to handle kernel paging request at fffffffefaacf48a

<3>[  472.811672] BUG: scheduling while atomic: kworker/1:15/3965/0x00000003

<4>[  472.812493] WARNING: CPU: 1 PID: 3965 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x70/0x6c0

<3>[  537.840247] INFO: rcu_preempt detected stalls on CPUs/tasks:
<3>[  537.840299] 	Tasks blocked on level-0 rcu_node (CPUs 0-3): P3965
<3>[  537.840342] 	(detected by 0, t=65002 jiffies, g=28545, c=28544, q=29901)
Comment 13 Marta Löfstedt 2017-11-14 07:55:24 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_492/fi-bxt-j4205/igt@vgem_basic@dmabuf-export.html

dmesg:
<7>[  473.265673] [IGT] vgem_basic: starting subtest debugfs
<4>[  473.266084] INFO: trying to register non-static key.
<4>[  473.266097] the code is fine but needs lockdep annotation.
<4>[  473.266103] turning off the locking correctness validator.
then ftrace spamming:
 <0>[  473.266614] Dumping ftrace buffer:
<0>[  473.266661] ---------------------------------
<0>[  473.266848] CPU:1 [LOST 153991 EVENTS]
                     <...>-3366    1..s1 325544570us : intel_lrc_irq_handler: vecs0 in[0]:  ctx=2.1, seqno=1adce
...
<0>[  486.328080]   <idle>-0       2..s1 472520536us : intel_lrc_irq_handler: vecs0 out[0]: ctx=3.1, seqno=23667

the the backtraces:
<4>[  486.328666] Call Trace:
<4>[  486.328683]  ? drm_mode_gamma_get_ioctl+0xf0/0xf0

<3>[  486.617585] BUG: sleeping function called from invalid context at ./include/linux/percpu-rwsem.h:34

<5>[  486.623297] systemd-journald[195]: /dev/kmsg buffer overrun, some messages lost.
<1>[  486.627944] BUG: unable to handle kernel paging request at fffffffeecf32030

<3>[  486.633071] BUG: scheduling while atomic: kworker/2:2/205/0x00000003
<4>[  486.633091] INFO: lockdep is turned off.

<4>[  486.633310] WARNING: CPU: 2 PID: 205 at kernel/rcu/tree_plugin.h:329 rcu_note_context_switch+0x70/0x6c0

<7>[  487.001473] [IGT] vgem_basic: executing
<7>[  487.030825] [IGT] vgem_basic: starting subtest dmabuf-export
<3>[  551.597068] INFO: rcu_preempt detected stalls on CPUs/tasks:
<3>[  551.597121] 	Tasks blocked on level-0 rcu_node (CPUs 0-3): P205
<3>[  551.597164] 	(detected by 2, t=65002 jiffies, g=32449, c=32448, q=28149)

Note pstore only has the first stuff:
https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_492/fi-bxt-j4205/dmesg-1510488286_Oops_1.log
<0>[  486.328080]   <idle>-0       2..s1 472520536us : intel_lrc_irq_handler: vecs0 out[0]: ctx=3.1, seqno=23667
<0>[  486.328117] ---------------------------------
<4>[  486.328138] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep lpc_ich snd_hda_core snd_pcm r8169 mii mei_me prime_numbers mei i2c_hid pinctrl_broxton pinctrl_intel
<4>[  486.328309] CPU: 2 PID: 4341 Comm: vgem_basic Tainted: G     U          4.14.0-rc8-CI-CI_DRM_3335+ #1
<4>[  486.328343] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J4205-ITX, BIOS P1.10 09/29/2016
<4>[  486.328379] task: ffff88026b45a780 task.stack: ffffc9000112c000
<4>[  486.328408] RIP: 0010:drm_framebuffer_info+0x77/0xc0
<4>[  486.328428] RSP: 0018:ffffc9000112fd68 EFLAGS: 00010286
<4>[  486.328453] RAX: 0000000000000000 RBX: fffffffffffffff8 RCX: 0000000000000001
<4>[  486.328479] RDX: 0000000080000001 RSI: 0000000000000000 RDI: 00000000ffffffff
<4>[  486.328506] RBP: ffffc9000112fda0 R08: 0000000000000000 R09: 0000000000000000
<4>[  486.328532] R10: ffffc9000112fd58 R11: ffffffff81607ef8 R12: ffff8801fd432950
<4>[  486.328559] R13: ffff8801fd4328c8 R14: ffffffff81ce969b R15: ffff88026674e2b8
<4>[  486.328586] FS:  00007fa4579e08c0(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
<4>[  486.328617] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  486.328640] CR2: 0000000000000010 CR3: 00000002649e3000 CR4: 00000000003406e0
<4>[  486.328666] Call Trace:
<4>[  486.328683]  ? drm_mode_gamma_get_ioctl+0xf0/0xf0
<4>[  486.328709]  seq_read+0xe1/0x3a0
<4>[  486.328732]  full_proxy_read+0x54/0x80
<4>[  486.328753]  __vfs_read+0x28/0x120
<4>[  486.328772]  ? putname+0x54/0x60
<4>[  486.328791]  ? rcu_read_lock_sched_held+0x7a/0x90
<4>[  486.328814]  ? entry_SYSCALL_64_fastpath+0x5/0xb1
<4>[  486.328836]  vfs_read+0xa8/0x150
<4>[  486.328855]  SyS_read+0x49/0xb0
<4>[  486.328873]  entry_SYSCALL_64_fastpath+0x1c/0xb1
<4>[  486.328894] RIP: 0033:0x7fa4561c1500
<4>[  486.328910] RSP: 002b:00007ffe7ac3d1b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
<4>[  486.328942] RAX: ffffffffffffffda RBX: ffffffff81492083 RCX: 00007fa4561c1500
<4>[  486.328968] RDX: 000000000000003f RSI: 000055869cdacc50 RDI: 0000000000000005
<4>[  486.328995] RBP: ffffc9000112ff88 R08: 0000000000000000 R09: 0000000000000000
<4>[  486.329021] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
<4>[  486.329047] R13: 000055869cdacc50 R14: 0000000000000000 R15: 0000000000000000
<4>[  486.329078]  ? __this_cpu_preempt_check+0x13/0x20
<4>[  486.329102] Code: 20 06 00 00 48 a9 f8 ff ff ff 74 63 49 8b 84 24 a8 06 00 00 49 81 c4 a8 06 00 00 49 c7 c6 9b 96 ce 81 49 39 c4 48 8d 58 f8 74 2d <8b> 53 18 48 8d 7d c8 4c 89 f6 e8 4a 5c 00 00 48 8d 7d c8 48 89 
<1>[  486.329327] RIP: drm_framebuffer_info+0x77/0xc0 RSP: ffffc9000112fd68
<4>[  486.329351] CR2: 0000000000000010
<4>[  486.329370] ---[ end trace c108931ffc3d15ee ]---
Comment 14 Chris Wilson 2017-11-14 17:59:47 UTC
All the vgem_basic oops are
commit 0c51ef8684e30333ee2a7cef359258d91727166b
Author: Noralf Trønnes <noralf@tronnes.org>
Date:   Tue Nov 14 00:12:01 2017 +0100

    drm/debugfs: Fix framebuffer debugfs file init
    
    The introduction of: drm/framebuffer: Add framebuffer debugfs file
    broke vgem. That patch assumed that all drivers had initialized the
    dev->mode_config.fb_lock mutex which happens in drm_mode_config_init().
    vgem doesn't need to call drm_mode_config_init().
    
    Fix this by only creating the framebuffer debugfs file for modesetting
    drivers.
    
    Fixes: 45d58b40292b ("drm/framebuffer: Add framebuffer debugfs file")
    Suggested-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Noralf Trønnes <noralf@tronnes.org>
    Reviewed-by: Dave Airlie <airlied@redhat.com>
    Signed-off-by: Jani Nikula <jani.nikula@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171113231201.26177-1-noralf@tronnes.org

The simple identifier for this is the  RIP: drm_framebuffer_info
Comment 15 Elizabeth 2017-11-14 19:55:05 UTC
Rising priority since it is BAT.
Comment 16 Marta Löfstedt 2017-11-15 07:34:01 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3344/shard-kbl5/igt@gem_cpu_reloc@full.html
The dmesg has all patterns from bug 103170, i.e starts with:
<3>[   96.704827] e1000e 0000:00:1f.6 eno1: Detected Hardware Unit Hang:
then:
<3>[  105.404218] [drm:drm_atomic_helper_wait_for_flip_done] *ERROR* [CRTC:56:pipe C] flip_done timed out
this goes on for the rest of dmesg. However the pstore doesn't make sense, in relation to dmesg timestamps.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3344/shard-kbl5/dmesg-1510675260_Panic_1.log
Comment 17 Marta Löfstedt 2017-11-15 10:15:51 UTC
commit 9865a6840df65109a62dcee2f18e6e65d62ae511
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 13 12:57:06 2017 +0000

    ftrace: Allow configuring global trace buffer size (for dump-on-oops)

Was integrated to CI_DRM_3344.

Apparently it doesn't work since https://bugs.freedesktop.org/show_bug.cgi?id=103706#c16 is on CU_DRM_3344.
Comment 18 Marta Löfstedt 2017-11-15 11:42:35 UTC
Here is another one:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3348/shard-glkb4/igt@gem_exec_nop@signal-all.html

Note, from CI_DRM_3349 the ftrace buffersize will be set to 8192,
Comment 19 Marta Löfstedt 2017-11-16 09:05:34 UTC
I file this here because I still believe we need to trim the ftrace buffer so that we get "header" of the BUG_ON backtrace.

I this case I believe it is probably bug 102035. However, since we also have bug 103533, I would need to objdump the actual kernel that was run in order to figure out which line it was.

I don't know if decreasing the
CONFIG_GLOBAL_TRACE_BUF_SIZE further will help. Maybe the ftrace will always write over the expected:

"<4>[  607.672089] ------------[ cut here ]------------
<2>[  607.672099] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:891!
<4>[  607.672155] invalid opcode: 0000 [#1] PREEMPT SMP"

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3351/shard-apl2/igt@kms_chv_cursor_fail@pipe-c-128x128-bottom-edge.html


<0>[  327.861785] kms_chv_-2351    1..s1 327841533us : intel_lrc_irq_handler: rcs0 csb[0d]: status=0x00008002:0x00000003
<0>[  327.861847] kms_chv_-2351    1..s1 327841534us : intel_lrc_irq_handler: rcs0 out[0]: ctx=3.1, seqno=19d37
<0>[  327.861865] ---------------------------------
<4>[  327.861875] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec lpc_ich snd_hwdep r8169 mii snd_hda_core mei_me snd_pcm mei prime_numbers pinctrl_broxton pinctrl_intel
<4>[  327.861960] CPU: 1 PID: 2351 Comm: kms_chv_cursor_ Tainted: G     U          4.14.0-CI-CI_DRM_3351+ #1
<4>[  327.861977] Hardware name:                  /NUC6CAYB, BIOS AYAPLCEL.86A.0040.2017.0619.1722 06/19/2017
<4>[  327.861995] task: ffff8802342ccec0 task.stack: ffffc900002b0000
<4>[  327.862049] RIP: 0010:intel_lrc_irq_handler+0x40f/0xa00 [i915]
<4>[  327.862061] RSP: 0018:ffff88027fc83ef0 EFLAGS: 00010202
<4>[  327.862073] RAX: 0000000000000005 RBX: 0000000000000000 RCX: 00000000000002b4
<4>[  327.862086] RDX: 0000000000000004 RSI: ffff88027608f5b8 RDI: ffff88027608f4e8
<4>[  327.862100] RBP: ffff88027fc83f58 R08: ffff8802761c3000 R09: 0000000000000002
<4>[  327.862113] R10: 0000000000000000 R11: ffff8802703679c0 R12: ffff8802703679c0
<4>[  327.862126] R13: ffff88027032d040 R14: ffff8802705f2158 R15: ffff8802705f2160
<4>[  327.862140] FS:  00007f25fc1b9a40(0000) GS:ffff88027fc80000(0000) knlGS:0000000000000000
<4>[  327.862155] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  327.862167] CR2: 00005598c236cc18 CR3: 000000023633a000 CR4: 00000000003406e0
<4>[  327.862181] Call Trace:
<4>[  327.862190]  <IRQ>
<4>[  327.862204]  tasklet_hi_action+0x8e/0x110
<4>[  327.862217]  __do_softirq+0xc6/0x4af
<4>[  327.862228]  do_softirq_own_stack+0x2a/0x40
<4>[  327.862238]  </IRQ>
<4>[  327.862282]  ? __i915_add_request+0x1ee/0x460 [i915]
<4>[  327.862295]  do_softirq.part.4+0x56/0x60
<4>[  327.862306]  __local_bh_enable_ip+0x8c/0xa0
<4>[  327.862350]  i915_gem_do_execbuffer+0xde4/0x1550 [i915]
<4>[  327.862371]  ? lock_acquire+0xaf/0x200
<4>[  327.862382]  ? __might_fault+0x3e/0x90
<4>[  327.862431]  ? i915_gem_execbuffer+0x2b0/0x2b0 [i915]
<4>[  327.862476]  i915_gem_execbuffer2+0x1d7/0x3c0 [i915]
<4>[  327.862488]  ? __might_fault+0x3e/0x90
<4>[  327.862532]  ? i915_gem_execbuffer+0x2b0/0x2b0 [i915]
<4>[  327.862544]  drm_ioctl_kernel+0x65/0xb0
<4>[  327.862555]  drm_ioctl+0x295/0x340
<4>[  327.862598]  ? i915_gem_execbuffer+0x2b0/0x2b0 [i915]
<4>[  327.862611]  ? lock_acquire+0xaf/0x200
<4>[  327.862622]  ? task_work_run+0x7b/0xc0
<4>[  327.862634]  do_vfs_ioctl+0x8f/0x670
<4>[  327.862644]  ? entry_SYSCALL_64_fastpath+0x5/0xb1
<4>[  327.862655]  ? trace_hardirqs_on_caller+0xde/0x1c0
<4>[  327.862667]  SyS_ioctl+0x3b/0x70
<4>[  327.862678]  entry_SYSCALL_64_fastpath+0x1c/0xb1
<4>[  327.862689] RIP: 0033:0x7f25fa3b8587
<4>[  327.862697] RSP: 002b:00007ffc179775e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
<4>[  327.862713] RAX: ffffffffffffffda RBX: ffffc900002b3ff0 RCX: 00007f25fa3b8587
<4>[  327.862726] RDX: 00007ffc17977680 RSI: 0000000040406469 RDI: 0000000000000004
<4>[  327.862739] RBP: 000000000000003c R08: 0000000000000008 R09: 0000000000000000
<4>[  327.862752] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000046
<4>[  327.862766] R13: 00007f25fa67cc20 R14: 0000000000000000 R15: 0000000000000000
<4>[  327.862782] Code: e9 d8 fc ff ff 41 8b 46 24 44 8d a8 70 03 00 00 48 8b 45 c0 4c 03 a8 b8 0b 00 00 41 c7 86 9c 03 00 00 ff ff ff ff e9 80 fc ff ff <0f> 0b 0f 0b 0f 0b 49 8b 86 80 03 00 00 4d 89 f7 a8 02 74 20 41 
<1>[  327.862934] RIP: intel_lrc_irq_handler+0x40f/0xa00 [i915] RSP: ffff88027fc83ef0
<7>[  327.863162] [drm:gen8_irq_handler [i915]] hotplug event received, stat 0x00000020, dig 0x10001918, pins 0x00000040
<7>[  327.863268] [drm:intel_hpd_irq_handler [i915]] digital hpd port C - short
<7>[  327.863374] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port C - short
<7>[  327.863436] [drm:intel_power_well_enable [i915]] enabling always-on
<7>[  327.863486] [drm:intel_power_well_enable [i915]] enabling DC off
<7>[  327.863839] [drm:gen9_set_dc_state [i915]] Setting DC state from 01 to 00
<4>[  327.863843] ---[ end trace e93b116545516f76 ]---
Comment 20 Marta Löfstedt 2017-11-16 12:16:14 UTC
*** Bug 103660 has been marked as a duplicate of this bug. ***
Comment 27 Tomi Sarvela 2017-11-24 12:41:26 UTC
kconfig/debug: CONFIG_FTRACE=n

This changes affects builds CI_DRM_3382 onwards. We'll see if this changes the unfortunate situation with pstore dumps.
Comment 28 Jani Saarinen 2017-11-28 07:30:27 UTC
Marta, any updates on this?
Comment 29 Marta Löfstedt 2017-11-28 09:06:59 UTC
(In reply to Jani Saarinen from comment #28)
> Marta, any updates on this?

No, CI is completely broken on 4.15.0-rc1, so how could I
Comment 30 Marta Löfstedt 2017-11-28 12:54:50 UTC
The issue is still present:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3398/shard-kbl5/igt@gem_cs_tlb@render.html

backtrace indicates softdog, but there is no room to get the "header" in due to too much ftrace stuff.
Comment 31 Marta Löfstedt 2017-11-29 10:13:37 UTC
bad case with pstores not matching dmesg, and/or destroyed by ftrace

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3401/shard-kbl1/igt@kms_flip@2x-plain-flip-ts-check.html
Comment 34 Marta Löfstedt 2017-12-01 11:54:34 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3408/shard-apl3/igt@drv_selftest@live_gtt.html

the dmesg timestamps don't match pstores so something appear to be wrong.
Comment 35 Marta Löfstedt 2017-12-04 07:24:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/igt@kms_flip@vblank-vs-dpms-suspend-interruptible.html

none of the pstores match dmesg and are clearly from unrelated runs, since this run is first after reboot.
Comment 36 Marta Löfstedt 2017-12-04 11:40:30 UTC
(In reply to Marta Löfstedt from comment #35)
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/
> igt@kms_flip@vblank-vs-dpms-suspend-interruptible.html
> 
> none of the pstores match dmesg and are clearly from unrelated runs, since
> this run is first after reboot.

Took out the HSW3 from the lab, there was no issue cleaning out pstore logs with pstore kernel.

There was no issue cleaning out pstore logs that was force generated on drm kernel with pstore kernel.

There was no issue cleaning out pstore gereated on drm kernel with drm kernel.

So, maybe there is something wrong with the algorith that rename the pstore logs or there was something wrong with pstore clean out early on 4.15 which has now been fixed.
Comment 37 Tomi Sarvela 2017-12-04 12:10:29 UTC
This is strange. In the beginning of CI_IGT_test there is

# sudo rm -rf /sys/fs/pstore/* || true

and in the end of CI_IGT_collect (which runs if CI_IGT_test fails, timeouts, is aborted etc)

# sudo mv -f /sys/fs/pstore/* ${LOGDIR}/${RUN}/|| true

This has worked a long time like this, so my guess is still that something has changed in the kernel.
Comment 38 Marta Löfstedt 2017-12-04 12:50:52 UTC
In support of there was a pstore clearing out issue that is now solved:

this pstore: 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3445/shard-hsw3/pstore2-1512165009_Panic_1.log 

appear to be from this run: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3434/shard-hsw3/dmesg21.log

no pstore logs was collected on CI_DRM_3434
Comment 40 Marta Löfstedt 2017-12-04 14:04:16 UTC
This must be the record nr of pstore logs...

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3440/fi-glk-dsi/igt@gem_exec_suspend@basic-s3.html
Comment 41 Marta Löfstedt 2017-12-04 14:19:46 UTC
Here is an odd run:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/igt@gem_render_copy_redux@normal.html

run.log:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/run15.log
running: igt/kms_flip/vblank-vs-modeset-suspend

[21/75] skip: 8, pass: 13 /                    
FATAL: command execution failed
java.io.EOFException

indicates igt@kms_flip@vblank-vs-modeset-suspend was last test. However:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shards.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/igt@kms_flip@vblank-vs-modeset-suspend.html

and there was no incompletes on that run, however, there are 2 pstores associated with this run:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/pstore15-1512101622_Oops_1.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3426/shard-kbl2/pstore15-1512101622_Panic_2.log
Comment 42 Marta Löfstedt 2017-12-05 08:26:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3451/shard-kbl3/igt@gem_tiled_swapping@non-threaded.html

looks like old pstores are still coming out.
Comment 43 Marta Löfstedt 2017-12-05 11:11:20 UTC
result from Meta test:
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/meta.html

These system have no pstore logs, so out of some reason they don't work as they should.
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-ivb-3520m/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-byt-j1900/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bxt-dsi/igt@meta_test@piglit-timeout.html

Well, I guess this is not to be expected to work for gvtdvm, right?
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-gvtdvm/igt@meta_test@piglit-timeout.html

These have loads of irrelevant pstores, which we have now hopefully cleaned out.
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-ivb-3770/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-r/igt@meta_test@piglit-timeout.html

These look OK, considering ftrace and KASAN was enabled.
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-5557u/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bdw-gvtdvm/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6700hq/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6700k/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-skl-6770hq/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-bxt-j4205/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7500u/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7560u/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-kbl-7567u/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-1/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-glk-dsi/igt@meta_test@piglit-timeout.html

these say ftrace buffer is empty:
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-byt-n2820/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-hsw-4770/igt@meta_test@piglit-timeout.html
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Custom_3478/fi-hsw-4770r/igt@meta_test@piglit-timeout.html
Comment 45 Marta Löfstedt 2017-12-07 08:16:39 UTC
on a positive note, bug 104152 has a legit pstore.
Comment 47 Marta Löfstedt 2017-12-12 08:39:48 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3487/shard-kbl7/

has pstore logs, however there are no incompletes on the run, so why are there old pstores associated with it?
Comment 48 Tomi Sarvela 2017-12-12 08:49:01 UTC
The pstore is from previous runs:
<4>[  897.782937] CPU: 0 PID: 948 Comm: owatch Tainted: G     U           4.15.0-rc2-CI-CI_DRM_3486+ #1

I'd be happy with any explanation why sudo rm -f /sys/fs/pstore/* in the beginning of the testing doesn't clear out the old files.
Comment 50 Marta Löfstedt 2017-12-13 08:24:36 UTC
I forced panic:ed shards-hsw4 when it was booted on 4.15.0-rc3-CI-Patchwork_7478+
The machine rebooted to pstore kernel:
4.15.0-rc3-CI-Patchwork_7478+
Welcome to Ubuntu 17.04 (GNU/Linux 4.11.0-rc3-pstore+ x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage
Last login: Wed Dec 13 10:14:49 2017 from 10.237.72.167
testrunner@hsw4:~$ uname -r
4.11.0-rc3-pstore+
testrunner@hsw4:~$ ls /sys/fs/pstore
dmesg-efi-151315300001001  dmesg-efi-151315300202001
testrunner@hsw4:~$ cat /sys/fs/pstore/dmesg-efi-151315300001001 
Oops#1 Part1
<4>[23212.086286] FS:  00007f69bbcc1700(0000) GS:ffff88041fb40000(0000) knlGS:0000000000000000
<4>[23212.086291] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[23212.086294] CR2: 0000000000000000 CR3: 000000040b71b001 CR4: 00000000001606e0
<4>[23212.086298] Call Trace:
<4>[23212.086301]  __handle_sysrq+0x123/0x200
<4>[23212.086305]  write_sysrq_trigger+0x43/0x50
<4>[23212.086308]  proc_reg_write+0x38/0x60
<4>[23212.086313]  __vfs_write+0x1e/0x130
<4>[23212.086317]  ? rcu_read_lock_sched_held+0x6f/0x80
<4>[23212.086320]  ? rcu_sync_lockdep_assert+0x25/0x50
<4>[23212.086323]  ? __sb_start_write+0xd5/0x1f0
<4>[23212.086326]  ? __sb_start_write+0xef/0x1f0
<4>[23212.086330]  vfs_write+0xc0/0x1b0
<4>[23212.086334]  SyS_write+0x40/0xa0
<4>[23212.086337]  entry_SYSCALL_64_fastpath+0x1c/0x89
<4>[23212.086340] RIP: 0033:0x7f69bb7ed8f0
<4>[23212.086343] RSP: 002b:00007ffee4d1f668 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
<4>[23212.086348] RAX: ffffffffffffffda RBX: 000055c7c3c86bf0 RCX: 00007f69bb7ed8f0
<4>[23212.086352] RDX: 0000000000000002 RSI: 00007ffee4d1f750 RDI: 0000000000000003
<4>[23212.086356] RBP: 000055c7c3c86be8 R08: 000055c7c3c86d70 R09: 00007f69bbcc1700
<4>[23212.086360] R10: 0000000000000001 R11: 0000000000000246 R12: 00007f69bbcdd1ef
<4>[23212.086364] R13: 00007f69bbcdd1ec R14: 00007f69bbcdd000 R15: 0000000000000001
<4>[23212.086369] Code: 33 e8 e9 72 b9 ff 48 c7 c2 52 c8 55 81 be 01 00 00 00 48 c7 c7 c0 49 e4 81 e8 91 a2 b7 ff c7 05 57 e5 e7 00 01 00 00 00 0f ae f8 <c6> 04 25 00 00 00 00 01 c3 e8 b6 9e b9 ff 84 c0 75 c4 48 c7 c2 
<1>[23212.086402] RIP: sysrq_handle_crash+0x3c/0x70 RSP: ffffc90001acfe10
<4>[23212.086406] CR2: 0000000000000000
<4>[23212.086409] ---[ end trace a1025166a09c59fd ]---
testrunner@hsw4:~$ cat /sys/fs/pstore/dmesg-efi-151315300202001 
Oops#1 Part2
<6>[23212.086128] sysrq: SysRq : Trigger a crash
<1>[23212.086188] BUG: unable to handle kernel NULL pointer dereference at           (null)
<1>[23212.086196] IP: sysrq_handle_crash+0x3c/0x70
<6>[23212.086199] PGD 0 P4D 0 
<4>[23212.086211] Oops: 0002 [#1] PREEMPT SMP
<0>[23212.086215] Dumping ftrace buffer:
<0>[23212.086220]    (ftrace buffer empty)
<4>[23212.086222] Modules linked in: snd_hda_intel i915 i2c_dev vgem snd_hda_codec_hdmi x86_pkg_temp_thermal snd_hda_codec_realtek intel_powerclamp snd_hda_codec_generic coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core r8169 snd_pcm lpc_ich mii prime_numbers mei_me mei [last unloaded: i915]
<4>[23212.086249] CPU: 5 PID: 2939 Comm: tee Tainted: G     U           4.15.0-rc3-CI-Patchwork_7478+ #1
<4>[23212.086254] Hardware name: MSI MS-7924/Z97M-G43(MS-7924), BIOS V1.12 02/15/2016
<4>[23212.086259] RIP: 0010:sysrq_handle_crash+0x3c/0x70
<4>[23212.086262] RSP: 0018:ffffc90001acfe10 EFLAGS: 00010282
<4>[23212.086266] RAX: ffff88040d328040 RBX: 0000000000000063 RCX: 1fcaac1f00000000
<4>[23212.086270] RDX: 0000000000000001 RSI: 00000000469deb48 RDI: 0000000000000000
<4>[23212.086274] RBP: ffffc90001acfe38 R08: ffff88040d328908 R09: 0000000067d877a6
<4>[23212.086278] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
<4>[23212.086282] R13: ffffffff81e9abc0 R14: 0000000000000000 R15: 00007f69bbab28c0

so, let's see when above will be pulled out.
Comment 51 Marta Löfstedt 2017-12-13 11:45:33 UTC
testrunner@hsw4:~$ uname -r
4.15.0-rc3-CI-CI_DRM_3504+

testrunner@hsw4:~$ sudo ls /sys/fs/pstore/
dmesg-efi-151315300001001  dmesg-efi-151315300202001

my pstores are still there...
Comment 52 Marta Löfstedt 2017-12-13 12:12:19 UTC
testrunner@hsw4:~$ uname -r
4.11.0-rc3-pstore+
testrunner@hsw4:~$ ls /sys/fs/pstore
dmesg-efi-151315300001001  dmesg-efi-151315300202001
testrunner@hsw4:~$ sudo rm /sys/fs/pstore/*
testrunner@hsw4:~$ ls /sys/fs/pstore
testrunner@hsw4:~$ sudo reboot
then it reboots to generic kernel, which of course doesn't have PSTORE_EFI_VARS enabled:
testrunner@hsw4:~$ uname -r
4.10.0-19-generic
testrunner@hsw4:~$ sudo ls /sys/fs/pstore

then it was picked up by IGT on 3504
testrunner@hsw4:~$ uname -r
4.15.0-rc3-CI-CI_DRM_3504+
testrunner@hsw4:~$ sudo ls /sys/fs/pstore
testrunner@hsw4:~$ 

i.e. deleting with pstore kernel appear to work, so why didn't reboot to pstore between runs before really clear out pstore?
Comment 53 Marta Löfstedt 2017-12-13 13:11:37 UTC
Here is a new occurence where old pstores are pulled out.

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4060/shard-glkb2/igt@kms_flip@flip-vs-panning-vs-hang-interruptible.html

note that last test in run.log doean't match the result. So, this is probably a Jenkins reboot on network issues.
Comment 54 Marta Löfstedt 2017-12-13 13:42:36 UTC
Note, I removed the impact in cibuglog on this bug. There is no point in causing reduced coverage on a lot of machines just because of old psotres broken by ftrace and now pstore not working at all.
So, currently in cibuglog this bug is only tracked on one machine that actually isn't even in the lab.
Comment 55 Marta Löfstedt 2017-12-13 14:12:41 UTC
Now there appear to be some real stuff on it, lets see how this will pan out it is currently running shard/29 on CI_DRM_3506

4.15.0-rc3-CI-CI_DRM_3506+
testrunner@hsw4:~$ sudo ls /sys/fs/pstore
dmesg-efi-151317130101001  dmesg-efi-151317130105001  dmesg-efi-151317130109001  dmesg-efi-151317130113001  dmesg-efi-151317130117001
dmesg-efi-151317130102001  dmesg-efi-151317130106001  dmesg-efi-151317130110001  dmesg-efi-151317130114001  dmesg-efi-151317130118001
dmesg-efi-151317130103001  dmesg-efi-151317130107001  dmesg-efi-151317130111001  dmesg-efi-151317130115001  dmesg-efi-151317130119001
dmesg-efi-151317130104001  dmesg-efi-151317130108001  dmesg-efi-151317130112001  dmesg-efi-151317130116001
Comment 56 Marta Löfstedt 2017-12-13 14:21:53 UTC
now HSW4 is running shard 35 on CI_DRM_3506,

but pstore stuff from before is still there:

testrunner@hsw4:~$ uname -r
4.15.0-rc3-CI-CI_DRM_3506+
testrunner@hsw4:~$ sudo ls /sys/fs/pstore
dmesg-efi-151317130101001  dmesg-efi-151317130105001  dmesg-efi-151317130109001  dmesg-efi-151317130113001  dmesg-efi-151317130117001
dmesg-efi-151317130102001  dmesg-efi-151317130106001  dmesg-efi-151317130110001  dmesg-efi-151317130114001  dmesg-efi-151317130118001
dmesg-efi-151317130103001  dmesg-efi-151317130107001  dmesg-efi-151317130111001  dmesg-efi-151317130115001  dmesg-efi-151317130119001
dmesg-efi-151317130104001  dmesg-efi-151317130108001  dmesg-efi-151317130112001  dmesg-efi-151317130116001
Comment 57 Marta Löfstedt 2017-12-13 14:28:24 UTC
Now it's running CI_DRM_3507, the pstore stuff is still there:

testrunner@hsw4:~$ uname -r
4.15.0-rc3-CI-CI_DRM_3507+
testrunner@hsw4:~$ sudo ls /sys/fs/pstore/
dmesg-efi-151317130101001  dmesg-efi-151317130105001  dmesg-efi-151317130109001  dmesg-efi-151317130113001  dmesg-efi-151317130117001
dmesg-efi-151317130102001  dmesg-efi-151317130106001  dmesg-efi-151317130110001  dmesg-efi-151317130114001  dmesg-efi-151317130118001
dmesg-efi-151317130103001  dmesg-efi-151317130107001  dmesg-efi-151317130111001  dmesg-efi-151317130115001  dmesg-efi-151317130119001
dmesg-efi-151317130104001  dmesg-efi-151317130108001  dmesg-efi-151317130112001  dmesg-efi-151317130116001
Comment 58 Marta Löfstedt 2017-12-13 15:00:48 UTC
Now its running PW 7483:

testrunner@hsw4:~$ uname -r
4.15.0-rc3-CI-Patchwork_7483+
testrunner@hsw4:~$ sudo ls /sys/fs/pstore/
dmesg-efi-151317130101001  dmesg-efi-151317130105001  dmesg-efi-151317130109001  dmesg-efi-151317130113001  dmesg-efi-151317130117001
dmesg-efi-151317130102001  dmesg-efi-151317130106001  dmesg-efi-151317130110001  dmesg-efi-151317130114001  dmesg-efi-151317130118001
dmesg-efi-151317130103001  dmesg-efi-151317130107001  dmesg-efi-151317130111001  dmesg-efi-151317130115001  dmesg-efi-151317130119001
dmesg-efi-151317130104001  dmesg-efi-151317130108001  dmesg-efi-151317130112001  dmesg-efi-151317130116001
Comment 59 Marta Löfstedt 2017-12-13 15:23:05 UTC
Created attachment 136140 [details]
pstore on hsw4 when Ci_DRM_3506 was running

I copied out attached pstore when HSW4 was running CI_DRM_3506. 

Note this pstore match dmesg of CI_DRM_3506/shard-hsw4/12/ perfectly

so, there appear to be no issue with writing the pstore.

However, the same files are still on the machine when it was running 4.15.0-rc3-CI-Patchwork_7484+
Comment 60 Marta Löfstedt 2017-12-15 09:08:42 UTC
This has been fixed by updating permissions for pstore collection

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.