Bug 108784 - [CI][BAT] igt@i_suspend@shrink - dmesg-warn - (java|i915_suspend) invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=(0|1000)
Summary: [CI][BAT] igt@i_suspend@shrink - dmesg-warn - (java|i915_suspend) invoked oom...
Status: RESOLVED WONTFIX
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest normal
Assignee: brian.welty
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-19 09:45 UTC by Martin Peres
Modified: 2019-02-15 20:49 UTC (History)
2 users (show)

See Also:
i915 platform: ALL
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-11-19 09:45:27 UTC
70 machines are producing the following errors:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5156/shard-iclb6/igt@i915_suspend@shrink.html

<4> [2684.655248] i915_suspend invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=1000
<4> [2684.655299] CPU: 3 PID: 9251 Comm: i915_suspend Tainted: G     U            4.20.0-rc3-CI-CI_DRM_5156+ #1
<4> [2684.655301] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [2684.655303] Call Trace:
<4> [2684.655308]  dump_stack+0x67/0x9b
<4> [2684.655314]  dump_header+0x60/0x42e
<4> [2684.655318]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [2684.655322]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [2684.655327]  oom_kill_process+0x310/0x3a0
<4> [2684.655332]  out_of_memory+0x101/0x380
<4> [2684.655336]  __alloc_pages_nodemask+0xda6/0x1170
<4> [2684.655342]  ? find_get_entry+0x198/0x2f0
<4> [2684.655351]  shmem_alloc_and_acct_page+0x6f/0x1d0
<4> [2684.655356]  shmem_getpage_gfp.isra.8+0x16d/0xd90
<4> [2684.655359]  ? reacquire_held_locks+0xb5/0x1b0
<4> [2684.655367]  shmem_fault+0x93/0x1f0
<4> [2684.655371]  ? __handle_mm_fault+0xce9/0xfc0
<4> [2684.655376]  __do_fault+0x1b/0x80
<4> [2684.655379]  ? _raw_spin_unlock+0x29/0x40
<4> [2684.655381]  __handle_mm_fault+0x9ac/0xfc0
<4> [2684.655389]  handle_mm_fault+0x197/0x3f0
<4> [2684.655394]  __get_user_pages+0x204/0x7f0
<4> [2684.655404]  populate_vma_page_range+0x68/0x70
<4> [2684.655407]  __mm_populate+0xa4/0x150
<4> [2684.655412]  do_mlock+0x10c/0x230
<4> [2684.655416]  __x64_sys_mlock+0x12/0x20
<4> [2684.655419]  do_syscall_64+0x55/0x190
<4> [2684.655422]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [2684.655425] RIP: 0033:0x7fca3785cc47
<4> [2684.655430] Code: Bad RIP value.
<4> [2684.655432] RSP: 002b:00007ffe28e52308 EFLAGS: 00000287 ORIG_RAX: 0000000000000095
<4> [2684.655435] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fca3785cc47
<4> [2684.655437] RDX: 00000003ce440000 RSI: 00000003ce440000 RDI: 00007fc6500aa000
<4> [2684.655438] RBP: 00007ffe28e52350 R08: 00007fca381b3980 R09: 00007fca37d4c2b0
<4> [2684.655440] R10: 0000000000000000 R11: 0000000000000287 R12: 000055e2cfaf10c0
<4> [2684.655441] R13: 00007ffe28e524a0 R14: 0000000000000000 R15: 0000000000000000
<4> [2684.655449] Mem-Info:
<4> [2684.655454] active_anon:0 inactive_anon:41 isolated_anon:0\x0a active_file:11 inactive_file:37 isolated_file:0\x0a unevictable:3946670 dirty:2 writeback:0 unstable:0\x0a slab_reclaimable:20558 slab_unreclaimable:19758\x0a mapped:3946565 shmem:3946670 pagetables:15204 bounce:0\x0a free:35801 free_pcp:30 free_cma:0
<4> [2684.655458] Node 0 active_anon:0kB inactive_anon:164kB active_file:44kB inactive_file:148kB unevictable:15786680kB isolated(anon):0kB isolated(file):0kB mapped:15786260kB dirty:8kB writeback:0kB shmem:15786680kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
<4> [2684.655463] DMA free:15884kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4> [2684.655464] lowmem_reserve[]: 0 704 15784 15784
<4> [2684.655471] DMA32 free:63212kB min:3012kB low:3764kB high:4516kB active_anon:4kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:658232kB writepending:0kB present:838716kB managed:725464kB mlocked:658232kB kernel_stack:0kB pagetables:1288kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB
<4> [2684.655472] lowmem_reserve[]: 0 0 15079 15079
<4> [2684.655478] Normal free:64108kB min:64500kB low:80624kB high:96748kB active_anon:0kB inactive_anon:0kB active_file:324kB inactive_file:16kB unevictable:15128292kB writepending:0kB present:15732736kB managed:15441548kB mlocked:15128076kB kernel_stack:4656kB pagetables:59528kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4> [2684.655479] lowmem_reserve[]: 0 0 0 0
<4> [2684.655482] DMA: 1*4kB (U) 1*8kB (U) 2*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
<4> [2684.655515] DMA32: 1*4kB (U) 2*8kB (UM) 3*16kB (UM) 2*32kB (U) 0*64kB 1*128kB (E) 0*256kB 3*512kB (UME) 2*1024kB (ME) 1*2048kB (U) 14*4096kB (ME) = 63236kB
<4> [2684.655526] Normal: 117*4kB (UME) 41*8kB (UME) 174*16kB (UME) 78*32kB (UME) 38*64kB (UME) 153*128kB (UM) 98*256kB (UME) 22*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 64444kB
<6> [2684.655538] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
<4> [2684.655540] 3946783 total pagecache pages
<4> [2684.655544] 7 pages in swap cache
<4> [2684.655546] Swap cache stats: add 54938, delete 54930, find 172/271
<4> [2684.655547] Free swap  = 1877244kB
<4> [2684.655549] Total swap = 2097148kB
<4> [2684.655551] 4146861 pages RAM
<4> [2684.655552] 0 pages HighMem/MovableOnly
<4> [2684.655554] 101137 pages reserved
<6> [2684.655555] Tasks state (memory values in pages):
<6> [2684.655557] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<6> [2684.655672] [    296]     0   296    25766        0   212992      201             0 systemd-journal
<6> [2684.655795] [    317]     0   317    11081        1   114688      497         -1000 systemd-udevd
<6> [2684.655800] [    492]   101   492    17652        0   176128      166             0 systemd-resolve
<6> [2684.655803] [    495]   102   495    65758        0   167936     1012             0 rsyslogd
<6> [2684.655807] [    497]   103   497    12580        1   143360      241          -900 dbus-daemon
<6> [2684.655811] [    518]     0   518    17647        1   176128      185             0 systemd-logind
<6> [2684.655815] [    525]     0   525    44726        1   237568     1992             0 networkd-dispat
<6> [2684.655818] [    533]     0   533     1137        0    57344       48             0 acpid
<6> [2684.655822] [    550]     0   550    27618        0   114688       98             0 irqbalance
<6> [2684.655826] [    561]     0   561     2478        0    61440       33             0 rngd
<6> [2684.655830] [    601]     0   601   141206        0   446464      739             0 NetworkManager
<6> [2684.655833] [    677]     0   677    18073        1   180224      187         -1000 sshd
<6> [2684.655837] [    678]     0   678    72218        0   200704      219             0 polkitd
<6> [2684.655841] [    682]     0   682     6413        1    94208      305             0 dhclient
<6> [2684.655845] [    829]     0   829    26430        1   245760      246             0 sshd
<6> [2684.655849] [    834]  1000   834    19153        1   192512      271             0 systemd
<6> [2684.655853] [    837]  1000   837    28477        0   253952      597             0 (sd-pam)
<6> [2684.655857] [    875]  1000   875    27133        1   249856      373             0 sshd
<6> [2684.655861] [    911]  1000   911  2215825        0   966656    44277             0 java
<6> [2684.655865] [    949]   112   949    96616        1   372736      425             0 whoopsie
<6> [2684.655868] [    958]     0   958     6126        0    69632       37             0 agetty
<6> [2684.655872] [   1037]     0  1037    18483        1   184320      119             0 sudo
<6> [2684.655876] [   1042]     0  1042     2478        0    65536       33             0 rngd
<6> [2684.655879] [   1681]     0  1681    18483        1   188416      118             0 sudo
<6> [2684.655883] [   1686]     0  1686     2478        0    57344       32             0 rngd
<6> [2684.655886] [   2307]     0  2307    18483        1   184320      119             0 sudo
<6> [2684.655890] [   2312]     0  2312     2478        0    65536       32             0 rngd
<6> [2684.655893] [   2987]     0  2987    18483        1   184320      118             0 sudo
<6> [2684.655897] [   2992]     0  2992     2478        0    61440       33             0 rngd
<6> [2684.655901] [   3743]     0  3743    18483        1   184320      119             0 sudo
<6> [2684.655904] [   3748]     0  3748     2478        0    61440       33             0 rngd
<6> [2684.655909] [   5537]     0  5537    18483        1   184320      118             0 sudo
<6> [2684.655920] [   5542]     0  5542     2478        0    65536       33             0 rngd
<6> [2684.655930] [   9097]  1000  9097     5302        1    77824       73             0 bash
<6> [2684.655934] [   9173]     0  9173    18483        1   180224      120             0 sudo
<6> [2684.655937] [   9178]     0  9178     2478        0    61440       32             0 rngd
<6> [2684.655941] [   9197]  1000  9197     6097        0    81920       38             0 dmesg
<6> [2684.655945] [   9199]     0  9199    18483        1   180224      119             0 sudo
<6> [2684.655948] [   9203]     0  9203    50678        1   397312      435             0 igt_runner
<6> [2684.655952] [   9250]     0  9250  4095973  2941969 23998464      430          1000 i915_suspend
<6> [2684.655956] [   9251]     0  9251  4095973  3946578 32022528      430          1000 i915_suspend
<3> [2684.655958] Out of memory: Kill process 9251 (i915_suspend) score 1865 or sacrifice child
<3> [2684.657457] Killed process 9251 (i915_suspend) total-vm:16383892kB, anon-rss:0kB, file-rss:0kB, shmem-rss:15786312kB
Comment 1 Chris Wilson 2018-11-19 10:00:17 UTC
The oomkiller is being invoked on purpose as part of the test and there is no way to suppress the error messages. Of course, sometimes, oomkiller fails and the test fails in which case the error is significant...
Comment 2 Martin Peres 2018-11-19 11:06:00 UTC
(In reply to Chris Wilson from comment #1)
> The oomkiller is being invoked on purpose as part of the test and there is
> no way to suppress the error messages. Of course, sometimes, oomkiller fails
> and the test fails in which case the error is significant...

This is new though. 70 machines hitting this at the same time, and having enormous logs filing up the DB...
Comment 3 Chris Wilson 2018-12-04 20:43:56 UTC
(In reply to Martin Peres from comment #2)
> (In reply to Chris Wilson from comment #1)
> > The oomkiller is being invoked on purpose as part of the test and there is
> > no way to suppress the error messages. Of course, sometimes, oomkiller fails
> > and the test fails in which case the error is significant...
> 
> This is new though. 70 machines hitting this at the same time, and having
> enormous logs filing up the DB...

It was only new due to the test name change (drv_suspend -> i915_suspend). There appears to be no way from userspace to suppress the warn_alloc_show_mem()
Comment 4 Martin Peres 2018-12-07 13:29:55 UTC
(In reply to Chris Wilson from comment #3)
> (In reply to Martin Peres from comment #2)
> > (In reply to Chris Wilson from comment #1)
> > > The oomkiller is being invoked on purpose as part of the test and there is
> > > no way to suppress the error messages. Of course, sometimes, oomkiller fails
> > > and the test fails in which case the error is significant...
> > 
> > This is new though. 70 machines hitting this at the same time, and having
> > enormous logs filing up the DB...
> 
> It was only new due to the test name change (drv_suspend -> i915_suspend).
> There appears to be no way from userspace to suppress the
> warn_alloc_show_mem()

Thanks for the explanation.

Could we add a kernel parameter to hide this? It could land in core-for-CI, and we could even try to upstream it.

Generating this warning in every run on every machine just fills the database, and adds noise to results which confuse people (dmesg-warn is not an acceptable status).
Comment 5 Lakshmi 2019-01-31 12:27:50 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html

	
<6> [172.407794] [IGT] gem_tiled_swapping: executing
<6> [183.558045] Purging GPU memory, 0 pages freed, 82 pages still pinned.
<4> [183.558188] thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
<4> [183.558209] CPU: 4 PID: 530 Comm: thermald Tainted: G     U            5.0.0-rc2-gcd04bc47971a-drmtip_195+ #1
<4> [183.558210] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018
<4> [183.558211] Call Trace:
<4> [183.558215]  dump_stack+0x67/0x9b
<4> [183.558218]  dump_header+0x52/0x58e
<4> [183.558221]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [183.558224]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [183.558226]  oom_kill_process+0x310/0x3a0
<4> [183.558229]  out_of_memory+0x101/0x3b0
<4> [183.558232]  __alloc_pages_nodemask+0xd6c/0x1110
<4> [183.558235]  ? lock_acquire+0xa6/0x1c0
<4> [183.558241]  __read_swap_cache_async+0x131/0x1d0
<4> [183.558244]  read_swap_cache_async+0x23/0x60
<4> [183.558247]  swapin_readahead+0x14a/0x3f0
<4> [183.558251]  ? pagecache_get_page+0x2b/0x210
<4> [183.558254]  ? do_swap_page+0x2ea/0x950
<4> [183.558256]  do_swap_page+0x2ea/0x950
<4> [183.558258]  ? __switch_to_asm+0x40/0x70
<4> [183.558261]  __handle_mm_fault+0x66a/0xfa0
<4> [183.558266]  handle_mm_fault+0x196/0x3a0
<4> [183.558270]  __do_page_fault+0x246/0x500
<4> [183.558273]  page_fault+0x1e/0x30
<4> [183.558276] RIP: 0010:do_sys_poll+0x395/0x580
<4> [183.558277] Code: 41 0c 85 c0 0f 8e 2b 01 00 00 31 c0 eb 10 83 c0 01 48 83 c2 08 39 41 08 0f 8e 17 01 00 00 0f 01 cb 48 63 f0 41 0f b7 74 f0 06 <66> 89 72 06 31 f6 0f 01 ca 85 f6 74 d7 c7 85 40 fc ff ff f2 ff ff
<4> [183.558279] RSP: 0018:ffffaacdc0c23ae0 EFLAGS: 00050246
<4> [183.558280] RAX: 0000000000000000 RBX: ffffaacdc0c23b5c RCX: ffffaacdc0c23b40
<4> [183.558281] RDX: 000055a7d8947018 RSI: 0000000000000000 RDI: 00000000fffffff2
<4> [183.558282] RBP: ffffaacdc0c23ef0 R08: ffffaacdc0c23b4c R09: 0000000000000000
<4> [183.558283] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
<4> [183.558284] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
<4> [183.558295]  ? lock_acquire+0xa6/0x1c0
<4> [183.558298]  ? __is_insn_slot_addr+0x8d/0x120
<4> [183.558301]  ? __lock_acquire+0x3c7/0x1b00
<4> [183.558305]  ? poll_select_copy_remaining+0x1b0/0x1b0
<4> [183.558308]  ? poll_select_copy_remaining+0x1b0/0x1b0
<4> [183.558310]  ? lock_acquire+0xa6/0x1c0
<4> [183.558313]  ? __lock_acquire+0x3c7/0x1b00
<4> [183.558315]  ? __lock_acquire+0x3c7/0x1b00
<4> [183.558319]  ? debug_object_active_state+0x137/0x160
<4> [183.558322]  ? _raw_spin_unlock_irqrestore+0x4c/0x60
<4> [183.558326]  ? poll_select_set_timeout+0x41/0x70
<4> [183.558329]  ? ktime_get_ts64+0x128/0x150
<4> [183.558331]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [183.558334]  ? recalibrate_cpu_khz+0x10/0x10
<4> [183.558335]  ? ktime_get_ts64+0x98/0x150
<4> [183.558338]  ? __se_sys_poll+0x8f/0x120
<4> [183.558340]  __se_sys_poll+0x8f/0x120
<4> [183.558343]  do_syscall_64+0x55/0x190
<4> [183.558345]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [183.558346] RIP: 0033:0x7f513b42fbf9
<4> [183.558350] Code: Bad RIP value.
<4> [183.558352] RSP: 002b:00007f5136395c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
<4> [183.558353] RAX: ffffffffffffffda RBX: 000055a7d8947018 RCX: 00007f513b42fbf9
<4> [183.558354] RDX: 0000000000000fa0 RSI: 0000000000000002 RDI: 000055a7d8947018
<4> [183.558355] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000080
<4> [183.558356] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000fa0
<4> [183.558357] R13: 00007f5136395cd0 R14: 0000000000000003 R15: 000055a7d8946ec0
<4> [183.558362] Mem-Info:
<4> [183.558365] active_anon:102400 inactive_anon:101892 isolated_anon:1024
 active_file:9 inactive_file:0 isolated_file:0
 unevictable:1666433 dirty:0 writeback:0 unstable:0
 slab_reclaimable:7664 slab_unreclaimable:97112
 mapped:0 shmem:100 pagetables:5118 bounce:0
 free:25494 free_pcp:0 free_cma:0
Comment 6 CI Bug Log 2019-01-31 12:33:20 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* fi-kbl-guc: igt@gem_tiled_swapping@non-threaded - Fail - thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE)
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html
Comment 7 Chris Wilson 2019-02-15 20:49:53 UTC
I gave in and removed the test from CI:

commit 324ab48e67065f0cf67525b3ab9c44fd3dcaef0a (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 15 19:09:10 2019 +0000

    intel-ci: Disable i915_suspend@shrink
    
    This test produces an awful, awful lot of redundant output as it tries
    to find just the right amount of memory pressure to cause an
    out-of-memory event in the middle of suspend. That is always quite a
    slow process, taking 90s on a normal machine and 500+s on skl-y.
    Furthermore, even when we do achieve the perfect setup, the test
    frequently locks up and fails to resume with no indication that it is a
    bug in the driver. The shrinker and oomkiller (plus i915) do not make for
    a pleasant time!
    
    Enough of Martin's whinging, I see no way of easily making this test
    quieter, quicker and more efficacious, relegate it to the masochist only
    stable.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Martin Peres <martin.peres@free.fr>
    Cc: Petri Latvala <petri.latvala@intel.com>
    Reviewed-by: Martin Peres <martin.peres@free.fr>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.