Bug 108784 - [CI][BAT] igt@i_suspend@shrink - dmesg-warn - (java|i915_suspend) invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=(0|1000)
Summary: [CI][BAT] igt@i_suspend@shrink - dmesg-warn - (java|i915_suspend) invoked oom...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-11-19 09:45 UTC by Martin Peres
Modified: 2018-12-07 13:29 UTC (History)
1 user (show)

See Also:
i915 platform: ALL
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-11-19 09:45:27 UTC
70 machines are producing the following errors:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5156/shard-iclb6/igt@i915_suspend@shrink.html

<4> [2684.655248] i915_suspend invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=1000
<4> [2684.655299] CPU: 3 PID: 9251 Comm: i915_suspend Tainted: G     U            4.20.0-rc3-CI-CI_DRM_5156+ #1
<4> [2684.655301] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2402.AD3.1810170014 10/17/2018
<4> [2684.655303] Call Trace:
<4> [2684.655308]  dump_stack+0x67/0x9b
<4> [2684.655314]  dump_header+0x60/0x42e
<4> [2684.655318]  ? lockdep_hardirqs_on+0xe0/0x1b0
<4> [2684.655322]  ? _raw_spin_unlock_irqrestore+0x39/0x60
<4> [2684.655327]  oom_kill_process+0x310/0x3a0
<4> [2684.655332]  out_of_memory+0x101/0x380
<4> [2684.655336]  __alloc_pages_nodemask+0xda6/0x1170
<4> [2684.655342]  ? find_get_entry+0x198/0x2f0
<4> [2684.655351]  shmem_alloc_and_acct_page+0x6f/0x1d0
<4> [2684.655356]  shmem_getpage_gfp.isra.8+0x16d/0xd90
<4> [2684.655359]  ? reacquire_held_locks+0xb5/0x1b0
<4> [2684.655367]  shmem_fault+0x93/0x1f0
<4> [2684.655371]  ? __handle_mm_fault+0xce9/0xfc0
<4> [2684.655376]  __do_fault+0x1b/0x80
<4> [2684.655379]  ? _raw_spin_unlock+0x29/0x40
<4> [2684.655381]  __handle_mm_fault+0x9ac/0xfc0
<4> [2684.655389]  handle_mm_fault+0x197/0x3f0
<4> [2684.655394]  __get_user_pages+0x204/0x7f0
<4> [2684.655404]  populate_vma_page_range+0x68/0x70
<4> [2684.655407]  __mm_populate+0xa4/0x150
<4> [2684.655412]  do_mlock+0x10c/0x230
<4> [2684.655416]  __x64_sys_mlock+0x12/0x20
<4> [2684.655419]  do_syscall_64+0x55/0x190
<4> [2684.655422]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [2684.655425] RIP: 0033:0x7fca3785cc47
<4> [2684.655430] Code: Bad RIP value.
<4> [2684.655432] RSP: 002b:00007ffe28e52308 EFLAGS: 00000287 ORIG_RAX: 0000000000000095
<4> [2684.655435] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fca3785cc47
<4> [2684.655437] RDX: 00000003ce440000 RSI: 00000003ce440000 RDI: 00007fc6500aa000
<4> [2684.655438] RBP: 00007ffe28e52350 R08: 00007fca381b3980 R09: 00007fca37d4c2b0
<4> [2684.655440] R10: 0000000000000000 R11: 0000000000000287 R12: 000055e2cfaf10c0
<4> [2684.655441] R13: 00007ffe28e524a0 R14: 0000000000000000 R15: 0000000000000000
<4> [2684.655449] Mem-Info:
<4> [2684.655454] active_anon:0 inactive_anon:41 isolated_anon:0\x0a active_file:11 inactive_file:37 isolated_file:0\x0a unevictable:3946670 dirty:2 writeback:0 unstable:0\x0a slab_reclaimable:20558 slab_unreclaimable:19758\x0a mapped:3946565 shmem:3946670 pagetables:15204 bounce:0\x0a free:35801 free_pcp:30 free_cma:0
<4> [2684.655458] Node 0 active_anon:0kB inactive_anon:164kB active_file:44kB inactive_file:148kB unevictable:15786680kB isolated(anon):0kB isolated(file):0kB mapped:15786260kB dirty:8kB writeback:0kB shmem:15786680kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? yes
<4> [2684.655463] DMA free:15884kB min:64kB low:80kB high:96kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15992kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4> [2684.655464] lowmem_reserve[]: 0 704 15784 15784
<4> [2684.655471] DMA32 free:63212kB min:3012kB low:3764kB high:4516kB active_anon:4kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:658232kB writepending:0kB present:838716kB managed:725464kB mlocked:658232kB kernel_stack:0kB pagetables:1288kB bounce:0kB free_pcp:120kB local_pcp:0kB free_cma:0kB
<4> [2684.655472] lowmem_reserve[]: 0 0 15079 15079
<4> [2684.655478] Normal free:64108kB min:64500kB low:80624kB high:96748kB active_anon:0kB inactive_anon:0kB active_file:324kB inactive_file:16kB unevictable:15128292kB writepending:0kB present:15732736kB managed:15441548kB mlocked:15128076kB kernel_stack:4656kB pagetables:59528kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
<4> [2684.655479] lowmem_reserve[]: 0 0 0 0
<4> [2684.655482] DMA: 1*4kB (U) 1*8kB (U) 2*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
<4> [2684.655515] DMA32: 1*4kB (U) 2*8kB (UM) 3*16kB (UM) 2*32kB (U) 0*64kB 1*128kB (E) 0*256kB 3*512kB (UME) 2*1024kB (ME) 1*2048kB (U) 14*4096kB (ME) = 63236kB
<4> [2684.655526] Normal: 117*4kB (UME) 41*8kB (UME) 174*16kB (UME) 78*32kB (UME) 38*64kB (UME) 153*128kB (UM) 98*256kB (UME) 22*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 64444kB
<6> [2684.655538] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
<4> [2684.655540] 3946783 total pagecache pages
<4> [2684.655544] 7 pages in swap cache
<4> [2684.655546] Swap cache stats: add 54938, delete 54930, find 172/271
<4> [2684.655547] Free swap  = 1877244kB
<4> [2684.655549] Total swap = 2097148kB
<4> [2684.655551] 4146861 pages RAM
<4> [2684.655552] 0 pages HighMem/MovableOnly
<4> [2684.655554] 101137 pages reserved
<6> [2684.655555] Tasks state (memory values in pages):
<6> [2684.655557] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
<6> [2684.655672] [    296]     0   296    25766        0   212992      201             0 systemd-journal
<6> [2684.655795] [    317]     0   317    11081        1   114688      497         -1000 systemd-udevd
<6> [2684.655800] [    492]   101   492    17652        0   176128      166             0 systemd-resolve
<6> [2684.655803] [    495]   102   495    65758        0   167936     1012             0 rsyslogd
<6> [2684.655807] [    497]   103   497    12580        1   143360      241          -900 dbus-daemon
<6> [2684.655811] [    518]     0   518    17647        1   176128      185             0 systemd-logind
<6> [2684.655815] [    525]     0   525    44726        1   237568     1992             0 networkd-dispat
<6> [2684.655818] [    533]     0   533     1137        0    57344       48             0 acpid
<6> [2684.655822] [    550]     0   550    27618        0   114688       98             0 irqbalance
<6> [2684.655826] [    561]     0   561     2478        0    61440       33             0 rngd
<6> [2684.655830] [    601]     0   601   141206        0   446464      739             0 NetworkManager
<6> [2684.655833] [    677]     0   677    18073        1   180224      187         -1000 sshd
<6> [2684.655837] [    678]     0   678    72218        0   200704      219             0 polkitd
<6> [2684.655841] [    682]     0   682     6413        1    94208      305             0 dhclient
<6> [2684.655845] [    829]     0   829    26430        1   245760      246             0 sshd
<6> [2684.655849] [    834]  1000   834    19153        1   192512      271             0 systemd
<6> [2684.655853] [    837]  1000   837    28477        0   253952      597             0 (sd-pam)
<6> [2684.655857] [    875]  1000   875    27133        1   249856      373             0 sshd
<6> [2684.655861] [    911]  1000   911  2215825        0   966656    44277             0 java
<6> [2684.655865] [    949]   112   949    96616        1   372736      425             0 whoopsie
<6> [2684.655868] [    958]     0   958     6126        0    69632       37             0 agetty
<6> [2684.655872] [   1037]     0  1037    18483        1   184320      119             0 sudo
<6> [2684.655876] [   1042]     0  1042     2478        0    65536       33             0 rngd
<6> [2684.655879] [   1681]     0  1681    18483        1   188416      118             0 sudo
<6> [2684.655883] [   1686]     0  1686     2478        0    57344       32             0 rngd
<6> [2684.655886] [   2307]     0  2307    18483        1   184320      119             0 sudo
<6> [2684.655890] [   2312]     0  2312     2478        0    65536       32             0 rngd
<6> [2684.655893] [   2987]     0  2987    18483        1   184320      118             0 sudo
<6> [2684.655897] [   2992]     0  2992     2478        0    61440       33             0 rngd
<6> [2684.655901] [   3743]     0  3743    18483        1   184320      119             0 sudo
<6> [2684.655904] [   3748]     0  3748     2478        0    61440       33             0 rngd
<6> [2684.655909] [   5537]     0  5537    18483        1   184320      118             0 sudo
<6> [2684.655920] [   5542]     0  5542     2478        0    65536       33             0 rngd
<6> [2684.655930] [   9097]  1000  9097     5302        1    77824       73             0 bash
<6> [2684.655934] [   9173]     0  9173    18483        1   180224      120             0 sudo
<6> [2684.655937] [   9178]     0  9178     2478        0    61440       32             0 rngd
<6> [2684.655941] [   9197]  1000  9197     6097        0    81920       38             0 dmesg
<6> [2684.655945] [   9199]     0  9199    18483        1   180224      119             0 sudo
<6> [2684.655948] [   9203]     0  9203    50678        1   397312      435             0 igt_runner
<6> [2684.655952] [   9250]     0  9250  4095973  2941969 23998464      430          1000 i915_suspend
<6> [2684.655956] [   9251]     0  9251  4095973  3946578 32022528      430          1000 i915_suspend
<3> [2684.655958] Out of memory: Kill process 9251 (i915_suspend) score 1865 or sacrifice child
<3> [2684.657457] Killed process 9251 (i915_suspend) total-vm:16383892kB, anon-rss:0kB, file-rss:0kB, shmem-rss:15786312kB
Comment 1 Chris Wilson 2018-11-19 10:00:17 UTC
The oomkiller is being invoked on purpose as part of the test and there is no way to suppress the error messages. Of course, sometimes, oomkiller fails and the test fails in which case the error is significant...
Comment 2 Martin Peres 2018-11-19 11:06:00 UTC
(In reply to Chris Wilson from comment #1)
> The oomkiller is being invoked on purpose as part of the test and there is
> no way to suppress the error messages. Of course, sometimes, oomkiller fails
> and the test fails in which case the error is significant...

This is new though. 70 machines hitting this at the same time, and having enormous logs filing up the DB...
Comment 3 Chris Wilson 2018-12-04 20:43:56 UTC
(In reply to Martin Peres from comment #2)
> (In reply to Chris Wilson from comment #1)
> > The oomkiller is being invoked on purpose as part of the test and there is
> > no way to suppress the error messages. Of course, sometimes, oomkiller fails
> > and the test fails in which case the error is significant...
> 
> This is new though. 70 machines hitting this at the same time, and having
> enormous logs filing up the DB...

It was only new due to the test name change (drv_suspend -> i915_suspend). There appears to be no way from userspace to suppress the warn_alloc_show_mem()
Comment 4 Martin Peres 2018-12-07 13:29:55 UTC
(In reply to Chris Wilson from comment #3)
> (In reply to Martin Peres from comment #2)
> > (In reply to Chris Wilson from comment #1)
> > > The oomkiller is being invoked on purpose as part of the test and there is
> > > no way to suppress the error messages. Of course, sometimes, oomkiller fails
> > > and the test fails in which case the error is significant...
> > 
> > This is new though. 70 machines hitting this at the same time, and having
> > enormous logs filing up the DB...
> 
> It was only new due to the test name change (drv_suspend -> i915_suspend).
> There appears to be no way from userspace to suppress the
> warn_alloc_show_mem()

Thanks for the explanation.

Could we add a kernel parameter to hide this? It could land in core-for-CI, and we could even try to upstream it.

Generating this warning in every run on every machine just fills the database, and adds noise to results which confuse people (dmesg-warn is not an acceptable status).


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.