Summary: | [CI][BAT] igt@i_suspend@shrink - dmesg-warn - (java|i915_suspend) invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=(0|1000) | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | brian.welty |
Status: | CLOSED WONTFIX | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | highest | CC: | intel-gfx-bugs, sudeep.dutt |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | ALL | i915 features: |
Description
Martin Peres
2018-11-19 09:45:27 UTC
The oomkiller is being invoked on purpose as part of the test and there is no way to suppress the error messages. Of course, sometimes, oomkiller fails and the test fails in which case the error is significant... (In reply to Chris Wilson from comment #1) > The oomkiller is being invoked on purpose as part of the test and there is > no way to suppress the error messages. Of course, sometimes, oomkiller fails > and the test fails in which case the error is significant... This is new though. 70 machines hitting this at the same time, and having enormous logs filing up the DB... (In reply to Martin Peres from comment #2) > (In reply to Chris Wilson from comment #1) > > The oomkiller is being invoked on purpose as part of the test and there is > > no way to suppress the error messages. Of course, sometimes, oomkiller fails > > and the test fails in which case the error is significant... > > This is new though. 70 machines hitting this at the same time, and having > enormous logs filing up the DB... It was only new due to the test name change (drv_suspend -> i915_suspend). There appears to be no way from userspace to suppress the warn_alloc_show_mem() (In reply to Chris Wilson from comment #3) > (In reply to Martin Peres from comment #2) > > (In reply to Chris Wilson from comment #1) > > > The oomkiller is being invoked on purpose as part of the test and there is > > > no way to suppress the error messages. Of course, sometimes, oomkiller fails > > > and the test fails in which case the error is significant... > > > > This is new though. 70 machines hitting this at the same time, and having > > enormous logs filing up the DB... > > It was only new due to the test name change (drv_suspend -> i915_suspend). > There appears to be no way from userspace to suppress the > warn_alloc_show_mem() Thanks for the explanation. Could we add a kernel parameter to hide this? It could land in core-for-CI, and we could even try to upstream it. Generating this warning in every run on every machine just fills the database, and adds noise to results which confuse people (dmesg-warn is not an acceptable status). https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html <6> [172.407794] [IGT] gem_tiled_swapping: executing <6> [183.558045] Purging GPU memory, 0 pages freed, 82 pages still pinned. <4> [183.558188] thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0 <4> [183.558209] CPU: 4 PID: 530 Comm: thermald Tainted: G U 5.0.0-rc2-gcd04bc47971a-drmtip_195+ #1 <4> [183.558210] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 3610 03/29/2018 <4> [183.558211] Call Trace: <4> [183.558215] dump_stack+0x67/0x9b <4> [183.558218] dump_header+0x52/0x58e <4> [183.558221] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [183.558224] ? _raw_spin_unlock_irqrestore+0x39/0x60 <4> [183.558226] oom_kill_process+0x310/0x3a0 <4> [183.558229] out_of_memory+0x101/0x3b0 <4> [183.558232] __alloc_pages_nodemask+0xd6c/0x1110 <4> [183.558235] ? lock_acquire+0xa6/0x1c0 <4> [183.558241] __read_swap_cache_async+0x131/0x1d0 <4> [183.558244] read_swap_cache_async+0x23/0x60 <4> [183.558247] swapin_readahead+0x14a/0x3f0 <4> [183.558251] ? pagecache_get_page+0x2b/0x210 <4> [183.558254] ? do_swap_page+0x2ea/0x950 <4> [183.558256] do_swap_page+0x2ea/0x950 <4> [183.558258] ? __switch_to_asm+0x40/0x70 <4> [183.558261] __handle_mm_fault+0x66a/0xfa0 <4> [183.558266] handle_mm_fault+0x196/0x3a0 <4> [183.558270] __do_page_fault+0x246/0x500 <4> [183.558273] page_fault+0x1e/0x30 <4> [183.558276] RIP: 0010:do_sys_poll+0x395/0x580 <4> [183.558277] Code: 41 0c 85 c0 0f 8e 2b 01 00 00 31 c0 eb 10 83 c0 01 48 83 c2 08 39 41 08 0f 8e 17 01 00 00 0f 01 cb 48 63 f0 41 0f b7 74 f0 06 <66> 89 72 06 31 f6 0f 01 ca 85 f6 74 d7 c7 85 40 fc ff ff f2 ff ff <4> [183.558279] RSP: 0018:ffffaacdc0c23ae0 EFLAGS: 00050246 <4> [183.558280] RAX: 0000000000000000 RBX: ffffaacdc0c23b5c RCX: ffffaacdc0c23b40 <4> [183.558281] RDX: 000055a7d8947018 RSI: 0000000000000000 RDI: 00000000fffffff2 <4> [183.558282] RBP: ffffaacdc0c23ef0 R08: ffffaacdc0c23b4c R09: 0000000000000000 <4> [183.558283] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 <4> [183.558284] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 <4> [183.558295] ? lock_acquire+0xa6/0x1c0 <4> [183.558298] ? __is_insn_slot_addr+0x8d/0x120 <4> [183.558301] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558305] ? poll_select_copy_remaining+0x1b0/0x1b0 <4> [183.558308] ? poll_select_copy_remaining+0x1b0/0x1b0 <4> [183.558310] ? lock_acquire+0xa6/0x1c0 <4> [183.558313] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558315] ? __lock_acquire+0x3c7/0x1b00 <4> [183.558319] ? debug_object_active_state+0x137/0x160 <4> [183.558322] ? _raw_spin_unlock_irqrestore+0x4c/0x60 <4> [183.558326] ? poll_select_set_timeout+0x41/0x70 <4> [183.558329] ? ktime_get_ts64+0x128/0x150 <4> [183.558331] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [183.558334] ? recalibrate_cpu_khz+0x10/0x10 <4> [183.558335] ? ktime_get_ts64+0x98/0x150 <4> [183.558338] ? __se_sys_poll+0x8f/0x120 <4> [183.558340] __se_sys_poll+0x8f/0x120 <4> [183.558343] do_syscall_64+0x55/0x190 <4> [183.558345] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [183.558346] RIP: 0033:0x7f513b42fbf9 <4> [183.558350] Code: Bad RIP value. <4> [183.558352] RSP: 002b:00007f5136395c90 EFLAGS: 00000293 ORIG_RAX: 0000000000000007 <4> [183.558353] RAX: ffffffffffffffda RBX: 000055a7d8947018 RCX: 00007f513b42fbf9 <4> [183.558354] RDX: 0000000000000fa0 RSI: 0000000000000002 RDI: 000055a7d8947018 <4> [183.558355] RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000080 <4> [183.558356] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000fa0 <4> [183.558357] R13: 00007f5136395cd0 R14: 0000000000000003 R15: 000055a7d8946ec0 <4> [183.558362] Mem-Info: <4> [183.558365] active_anon:102400 inactive_anon:101892 isolated_anon:1024 active_file:9 inactive_file:0 isolated_file:0 unevictable:1666433 dirty:0 writeback:0 unstable:0 slab_reclaimable:7664 slab_unreclaimable:97112 mapped:0 shmem:100 pagetables:5118 bounce:0 free:25494 free_pcp:0 free_cma:0 The CI Bug Log issue associated to this bug has been updated. ### New filters associated * fi-kbl-guc: igt@gem_tiled_swapping@non-threaded - Fail - thermald invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE) - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_195/fi-kbl-guc/igt@gem_tiled_swapping@non-threaded.html I gave in and removed the test from CI: commit 324ab48e67065f0cf67525b3ab9c44fd3dcaef0a (upstream/master, origin/master, origin/HEAD) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 15 19:09:10 2019 +0000 intel-ci: Disable i915_suspend@shrink This test produces an awful, awful lot of redundant output as it tries to find just the right amount of memory pressure to cause an out-of-memory event in the middle of suspend. That is always quite a slow process, taking 90s on a normal machine and 500+s on skl-y. Furthermore, even when we do achieve the perfect setup, the test frequently locks up and fails to resume with no indication that it is a bug in the driver. The shrinker and oomkiller (plus i915) do not make for a pleasant time! Enough of Martin's whinging, I see no way of easily making this test quieter, quicker and more efficacious, relegate it to the masochist only stable. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Martin Peres <martin.peres@free.fr> Cc: Petri Latvala <petri.latvala@intel.com> Reviewed-by: Martin Peres <martin.peres@free.fr> (In reply to Chris Wilson from comment #7) > I gave in and removed the test from CI: > > commit 324ab48e67065f0cf67525b3ab9c44fd3dcaef0a (upstream/master, > origin/master, origin/HEAD) > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Feb 15 19:09:10 2019 +0000 > > intel-ci: Disable i915_suspend@shrink > > This test produces an awful, awful lot of redundant output as it tries > to find just the right amount of memory pressure to cause an > out-of-memory event in the middle of suspend. That is always quite a > slow process, taking 90s on a normal machine and 500+s on skl-y. > Furthermore, even when we do achieve the perfect setup, the test > frequently locks up and fails to resume with no indication that it is a > bug in the driver. The shrinker and oomkiller (plus i915) do not make for > a pleasant time! > > Enough of Martin's whinging, I see no way of easily making this test > quieter, quicker and more efficacious, relegate it to the masochist only > stable. > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Martin Peres <martin.peres@free.fr> > Cc: Petri Latvala <petri.latvala@intel.com> > Reviewed-by: Martin Peres <martin.peres@free.fr> Thanks Chris, this definitely helps the runtime and DB space usage! The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.