70 machines are producing the following errors: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5156/shard-iclb6/igt@i915_suspend@shrink.html <4> [2684.655248] i915_suspend invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=1000 <4> [2684.655299] CPU: 3 PID: 9251 Comm: i915_suspend Tainted: G U 4.20.0-rc3-CI-CI_DRM_5156+ #1 <4> [2684.655301] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018 <4> [2684.655303] Call Trace: <4> [2684.655308] dump_stack+0x67/0x9b <4> [2684.655314] dump_header+0x60/0x42e <4> [2684.655318] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [2684.655322] ? _raw_spin_unlock_irqrestore+0x39/0x60 <4> [2684.655327] oom_kill_process+0x310/0x3a0 <4> [2684.655332] out_of_memory+0x101/0x380 <4> [2684.655336] __alloc_pages_nodemask+0xda6/0x1170 <4> [2684.655342] ? find_get_entry+0x198/0x2f0 <4> [2684.655351] shmem_alloc_and_acct_page+0x6f/0x1d0 <4> [2684.655356] shmem_getpage_gfp.isra.8+0x16d/0xd90 <4> [2684.655359] ? reacquire_held_locks+0xb5/0x1b0 <4> [2684.655367] shmem_fault+0x93/0x1f0 <4> [2684.655371] ? __handle_mm_fault+0xce9/0xfc0 <4> [2684.655376] __do_fault+0x1b/0x80 <4> [2684.655379] ? _raw_spin_unlock+0x29/0x40 <4> [2684.655381] __handle_mm_fault+0x9ac/0xfc0 <4> [2684.655389] handle_mm_fault+0x197/0x3f0 <4> [2684.655394] __get_user_pages+0x204/0x7f0 <4> [2684.655404] populate_vma_page_range+0x68/0x70 <4> [2684.655407] __mm_populate+0xa4/0x150 <4> [2684.655412] do_mlock+0x10c/0x230 <4> [2684.655416] __x64_sys_mlock+0x12/0x20 <4> [2684.655419] do_syscall_64+0x55/0x190 <4> [2684.655422] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [2684.655425] RIP: 0033:0x7fca3785cc47 <4> [2684.655430] Code: Bad RIP value. <4> [2684.655432] RSP: 002b:00007ffe28e52308 EFLAGS: 00000287 ORIG_RAX: 0000000000000095 <4> [2684.655435] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fca3785cc47 <4> [2684.655437] RDX: 00000003ce440000 RSI: 00000003ce440000 RDI: 00007fc6500aa000 <4> [2684.655438] RBP: 00007ffe28e52350 R08: 00007fca381b3980 R09: 00007fca37d4c2b0 <4> [2684.655440] R10: 0000000000000000 R11: 0000000000000287 R12: 000055e2cfaf10c0 <4> [2684.655441] R13: 00007ffe28e524a0 R14: 0000000000000000 R15: 0000000000000000 <4> [2684.655449] Mem-Info: <4> [2684.655454] active_anon:0 inactive_anon:41 isolated_anon:0\x0a active_file:11 inactive_file:37 isolated_file:0\x0a unevictable:3946670 dirty:2 writeback:0 unstable:0\x0a slab_reclaimable:20558 slab_unreclaimable:19758\x0a mapped:3946565 shmem:3946670 pagetables:15204 bounce:0\x0a free:35801 free_pcp:30 free_cma:0 <4> [2684.655458] Node 0 active_anon:0kB inactive_anon:164kB active_file:44kB inactive_file:148kB unevictable:15786680kB isolated(anon):0kB isolated(file):0kB mapped:15786260kB dirty:8kB writeback:0kB shmem:15786680kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes <4> [2684.655463] DMA free:15884kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB <4> [2684.655464] lowmem_reserve[]: 0 704 15784 15784 <4> [2684.655471] DMA32 free:63212kB min:3012kB low:3764kB high:4516kB active_anon:4kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:658232kB writepending:0kB present:838716kB managed:725464kB mlocked:658232kB kernel_stack:0kB pagetables:1288kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB <4> [2684.655472] lowmem_reserve[]: 0 0 15079 15079 <4> [2684.655478] Normal free:64108kB min:64500kB low:80624kB high:96748kB active_anon:0kB inactive_anon:0kB active_file:324kB inactive_file:16kB unevictable:15128292kB writepending:0kB present:15732736kB managed:15441548kB mlocked:15128076kB kernel_stack:4656kB pagetables:59528kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB <4> [2684.655479] lowmem_reserve[]: 0 0 0 0 <4> [2684.655482] DMA: 1*4kB (U) 1*8kB (U) 2*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB <4> [2684.655515] DMA32: 1*4kB (U) 2*8kB (UM) 3*16kB (UM) 2*32kB (U) 0*64kB 1*128kB (E) 0*256kB 3*512kB (UME) 2*1024kB (ME) 1*2048kB (U) 14*4096kB (ME) = 63236kB <4> [2684.655526] Normal: 117*4kB (UME) 41*8kB (UME) 174*16kB (UME) 78*32kB (UME) 38*64kB (UME) 153*128kB (UM) 98*256kB (UME) 22*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 64444kB <6> [2684.655538] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB <4> [2684.655540] 3946783 total pagecache pages <4> [2684.655544] 7 pages in swap cache <4> [2684.655546] Swap cache stats: add 54938, delete 54930, find 172/271 <4> [2684.655547] Free swap = 1877244kB <4> [2684.655549] Total swap = 2097148kB <4> [2684.655551] 4146861 pages RAM <4> [2684.655552] 0 pages HighMem/MovableOnly <4> [2684.655554] 101137 pages reserved <6> [2684.655555] Tasks state (memory values in pages): <6> [2684.655557] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name <6> [2684.655672] [ 296] 0 296 25766 0 212992 201 0 systemd-journal <6> [2684.655795] [ 317] 0 317 11081 1 114688 497 -1000 systemd-udevd <6> [2684.655800] [ 492] 101 492 17652 0 176128 166 0 systemd-resolve <6> [2684.655803] [ 495] 102 495 65758 0 167936 1012 0 rsyslogd <6> [2684.655807] [ 497] 103 497 12580 1 143360 241 -900 dbus-daemon <6> [2684.655811] [ 518] 0 518 17647 1 176128 185 0 systemd-logind <6> [2684.655815] [ 525] 0 525 44726 1 237568 1992 0 networkd-dispat <6> [2684.655818] [ 533] 0 533 1137 0 57344 48 0 acpid <6> [2684.655822] [ 550] 0 550 27618 0 114688 98 0 irqbalance <6> [2684.655826] [ 561] 0 561 2478 0 61440 33 0 rngd <6> [2684.655830] [ 601] 0 601 141206 0 446464 739 0 NetworkManager <6> [2684.655833] [ 677] 0 677 18073 1 180224 187 -1000 sshd <6> [2684.655837] [ 678] 0 678 72218 0 200704 219 0 polkitd <6> [2684.655841] [ 682] 0 682 6413 1 94208 305 0 dhclient <6> [2684.655845] [ 829] 0 829 26430 1 245760 246 0 sshd <6> [2684.655849] [ 834] 1000 834 19153 1 192512 271 0 systemd <6> [2684.655853] [ 837] 1000 837 28477 0 253952 597 0 (sd-pam) <6> [2684.655857] [ 875] 1000 875 27133 1 249856 373 0 sshd <6> [2684.655861] [ 911] 1000 911 2215825 0 966656 44277 0 java <6> [2684.655865] [ 949] 112 949 96616 1 372736 425 0 whoopsie <6> [2684.655868] [ 958] 0 958 6126 0 69632 37 0 agetty <6> [2684.655872] [ 1037] 0 1037 18483 1 184320 119 0 sudo <6> [2684.655876] [ 1042] 0 1042 2478 0 65536 33 0 rngd <6> [2684.655879] [ 1681] 0 1681 18483 1 188416 118 0 sudo <6> [2684.655883] [ 1686] 0 1686 2478 0 57344 32 0 rngd <6> [2684.655886] [ 2307] 0 2307 18483 1 184320 119 0 sudo <6> [2684.655890] [ 2312] 0 2312 2478 0 65536 32 0 rngd <6> [2684.655893] [ 2987] 0 2987 18483 1 184320 118 0 sudo <6> [2684.655897] [ 2992] 0 2992 2478 0 61440 33 0 rngd <6> [2684.655901] [ 3743] 0 3743 18483 1 184320 119 0 sudo <6> [2684.655904] [ 3748] 0 3748 2478 0 61440 33 0 rngd <6> [2684.655909] [ 5537] 0 5537 18483 1 184320 118 0 sudo <6> [2684.655920] [ 5542] 0 5542 2478 0 65536 33 0 rngd <6> [2684.655930] [ 9097] 1000 9097 5302 1 77824 73 0 bash <6> [2684.655934] [ 9173] 0 9173 18483 1 180224 120 0 sudo <6> [2684.655937] [ 9178] 0 9178 2478 0 61440 32 0 rngd <6> [2684.655941] [ 9197] 1000 9197 6097 0 81920 38 0 dmesg <6> [2684.655945] [ 9199] 0 9199 18483 1 180224 119 0 sudo <6> [2684.655948] [ 9203] 0 9203 50678 1 397312 435 0 igt_runner <6> [2684.655952] [ 9250] 0 9250 4095973 2941969 23998464 430 1000 i915_suspend <6> [2684.655956] [ 9251] 0 9251 4095973 3946578 32022528 430 1000 i915_suspend <3> [2684.655958] Out of memory: Kill process 9251 (i915_suspend) score 1865 or sacrifice child <3> [2684.657457] Killed process 9251 (i915_suspend) total-vm:16383892kB, anon-rss:0kB, file-rss:0kB, shmem-rss:15786312kB
The oomkiller is being invoked on purpose as part of the test and there is no way to suppress the error messages. Of course, sometimes, oomkiller fails and the test fails in which case the error is significant...
(In reply to Chris Wilson from comment #1) > The oomkiller is being invoked on purpose as part of the test and there is > no way to suppress the error messages. Of course, sometimes, oomkiller fails > and the test fails in which case the error is significant... This is new though. 70 machines hitting this at the same time, and having enormous logs filing up the DB...
(In reply to Martin Peres from comment #2) > (In reply to Chris Wilson from comment #1) > > The oomkiller is being invoked on purpose as part of the test and there is > > no way to suppress the error messages. Of course, sometimes, oomkiller fails > > and the test fails in which case the error is significant... > > This is new though. 70 machines hitting this at the same time, and having > enormous logs filing up the DB... It was only new due to the test name change (drv_suspend -> i915_suspend). There appears to be no way from userspace to suppress the warn_alloc_show_mem()
(In reply to Chris Wilson from comment #3) > (In reply to Martin Peres from comment #2) > > (In reply to Chris Wilson from comment #1) > > > The oomkiller is being invoked on purpose as part of the test and there is > > > no way to suppress the error messages. Of course, sometimes, oomkiller fails > > > and the test fails in which case the error is significant... > > > > This is new though. 70 machines hitting this at the same time, and having > > enormous logs filing up the DB... > > It was only new due to the test name change (drv_suspend -> i915_suspend). > There appears to be no way from userspace to suppress the > warn_alloc_show_mem() Thanks for the explanation. Could we add a kernel parameter to hide this? It could land in core-for-CI, and we could even try to upstream it. Generating this warning in every run on every machine just fills the database, and adds noise to results which confuse people (dmesg-warn is not an acceptable status).
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html <6> [172.407794] [IGT] gem_tiled_swapping: executing <6> [183.558045] Purging GPU memory, 0 pages freed, 82 pages still pinned. <4> [183.558188] thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 <4> [183.558209] CPU: 4 PID: 530 Comm: thermald Tainted: G U 5.0.0-rc2-gcd04bc47971a-drmtip_195+ #1 <4> [183.558210] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018 <4> [183.558211] Call Trace: <4> [183.558215] dump_stack+0x67/0x9b <4> [183.558218] dump_header+0x52/0x58e <4> [183.558221] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [183.558224] ? _raw_spin_unlock_irqrestore+0x39/0x60 <4> [183.558226] oom_kill_process+0x310/0x3a0 <4> [183.558229] out_of_memory+0x101/0x3b0 <4> [183.558232] __alloc_pages_nodemask+0xd6c/0x1110 <4> [183.558235] ? lock_acquire+0xa6/0x1c0 <4> [183.558241] __read_swap_cache_async+0x131/0x1d0 <4> [183.558244] read_swap_cache_async+0x23/0x60 <4> [183.558247] swapin_readahead+0x14a/0x3f0 <4> [183.558251] ? pagecache_get_page+0x2b/0x210 <4> [183.558254] ? do_swap_page+0x2ea/0x950 <4> [183.558256] do_swap_page+0x2ea/0x950 <4> [183.558258] ? __switch_to_asm+0x40/0x70 <4> [183.558261] __handle_mm_fault+0x66a/0xfa0 <4> [183.558266] handle_mm_fault+0x196/0x3a0 <4> [183.558270] __do_page_fault+0x246/0x500 <4> [183.558273] page_fault+0x1e/0x30 <4> [183.558276] RIP: 0010:do_sys_poll+0x395/0x580 <4> [183.558277] Code: 41 0c 85 c0 0f 8e 2b 01 00 00 31 c0 eb 10 83 c0 01 48 83 c2 08 39 41 08 0f 8e 17 01 00 00 0f 01 cb 48 63 f0 41 0f b7 74 f0 06 <66> 89 72 06 31 f6 0f 01 ca 85 f6 74 d7 c7 85 40 fc ff ff f2 ff ff <4> [183.558279] RSP: 0018:ffffaacdc0c23ae0 EFLAGS: 00050246 <4> [183.558280] RAX: 0000000000000000 RBX: ffffaacdc0c23b5c RCX: ffffaacdc0c23b40 <4> [183.558281] RDX: 000055a7d8947018 RSI: 0000000000000000 RDI: 00000000fffffff2 <4> [183.558282] RBP: ffffaacdc0c23ef0 R08: ffffaacdc0c23b4c R09: 0000000000000000 <4> [183.558283] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4> [183.558284] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 <4> [183.558295] ? lock_acquire+0xa6/0x1c0 <4> [183.558298] ? __is_insn_slot_addr+0x8d/0x120 <4> [183.558301] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558305] ? poll_select_copy_remaining+0x1b0/0x1b0 <4> [183.558308] ? poll_select_copy_remaining+0x1b0/0x1b0 <4> [183.558310] ? lock_acquire+0xa6/0x1c0 <4> [183.558313] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558315] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558319] ? debug_object_active_state+0x137/0x160 <4> [183.558322] ? _raw_spin_unlock_irqrestore+0x4c/0x60 <4> [183.558326] ? poll_select_set_timeout+0x41/0x70 <4> [183.558329] ? ktime_get_ts64+0x128/0x150 <4> [183.558331] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [183.558334] ? recalibrate_cpu_khz+0x10/0x10 <4> [183.558335] ? ktime_get_ts64+0x98/0x150 <4> [183.558338] ? __se_sys_poll+0x8f/0x120 <4> [183.558340] __se_sys_poll+0x8f/0x120 <4> [183.558343] do_syscall_64+0x55/0x190 <4> [183.558345] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [183.558346] RIP: 0033:0x7f513b42fbf9 <4> [183.558350] Code: Bad RIP value. <4> [183.558352] RSP: 002b:00007f5136395c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000007 <4> [183.558353] RAX: ffffffffffffffda RBX: 000055a7d8947018 RCX: 00007f513b42fbf9 <4> [183.558354] RDX: 0000000000000fa0 RSI: 0000000000000002 RDI: 000055a7d8947018 <4> [183.558355] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000080 <4> [183.558356] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000fa0 <4> [183.558357] R13: 00007f5136395cd0 R14: 0000000000000003 R15: 000055a7d8946ec0 <4> [183.558362] Mem-Info: <4> [183.558365] active_anon:102400 inactive_anon:101892 isolated_anon:1024 active_file:9 inactive_file:0 isolated_file:0 unevictable:1666433 dirty:0 writeback:0 unstable:0 slab_reclaimable:7664 slab_unreclaimable:97112 mapped:0 shmem:100 pagetables:5118 bounce:0 free:25494 free_pcp:0 free_cma:0
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * fi-kbl-guc: igt@gem_tiled_swapping@non-threaded - Fail - thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE) - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html
I gave in and removed the test from CI: commit 324ab48e67065f0cf67525b3ab9c44fd3dcaef0a (upstream/master, origin/master, origin/HEAD) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 15 19:09:10 2019 +0000 intel-ci: Disable i915_suspend@shrink This test produces an awful, awful lot of redundant output as it tries to find just the right amount of memory pressure to cause an out-of-memory event in the middle of suspend. That is always quite a slow process, taking 90s on a normal machine and 500+s on skl-y. Furthermore, even when we do achieve the perfect setup, the test frequently locks up and fails to resume with no indication that it is a bug in the driver. The shrinker and oomkiller (plus i915) do not make for a pleasant time! Enough of Martin's whinging, I see no way of easily making this test quieter, quicker and more efficacious, relegate it to the masochist only stable. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Martin Peres <martin.peres@free.fr> Cc: Petri Latvala <petri.latvala@intel.com> Reviewed-by: Martin Peres <martin.peres@free.fr>
(In reply to Chris Wilson from comment #7) > I gave in and removed the test from CI: > > commit 324ab48e67065f0cf67525b3ab9c44fd3dcaef0a (upstream/master, > origin/master, origin/HEAD) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Feb 15 19:09:10 2019 +0000 > > intel-ci: Disable i915_suspend@shrink > > This test produces an awful, awful lot of redundant output as it tries > to find just the right amount of memory pressure to cause an > out-of-memory event in the middle of suspend. That is always quite a > slow process, taking 90s on a normal machine and 500+s on skl-y. > Furthermore, even when we do achieve the perfect setup, the test > frequently locks up and fails to resume with no indication that it is a > bug in the driver. The shrinker and oomkiller (plus i915) do not make for > a pleasant time! > > Enough of Martin's whinging, I see no way of easily making this test > quieter, quicker and more efficacious, relegate it to the masochist only > stable. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Martin Peres <martin.peres@free.fr> > Cc: Petri Latvala <petri.latvala@intel.com> > Reviewed-by: Martin Peres <martin.peres@free.fr> Thanks Chris, this definitely helps the runtime and DB space usage!
The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.