Bug 106609 - [CI] igt@drv_selftest@live_gtt - dmesg-fail - drv_selftest invoked oom-killer
Summary: [CI] igt@drv_selftest@live_gtt - dmesg-fail - drv_selftest invoked oom-killer
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-22 07:35 UTC by Martin Peres
Modified: 2018-05-22 21:16 UTC (History)
1 user (show)

See Also:
i915 platform: BXT
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-05-22 07:35:08 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_4158/shard-apl2/igt@drv_selftest@live_gtt.html

[ 1780.015457] 1 and 0 pages still available in the bound and unbound GPU page lists.
[ 1780.015563] drv_selftest invoked oom-killer: gfp_mask=0x14042c0(GFP_KERNEL|__GFP_NOWARN|__GFP_COMP), nodemask=(null), order=0, oom_score_adj=1000
[ 1780.015591] CPU: 2 PID: 12201 Comm: drv_selftest Tainted: G     U            4.17.0-rc4-CI-CI_DRM_4158+ #1
[ 1780.015610] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0047.2018.0108.1419 01/08/2018
[ 1780.015627] Call Trace:
[ 1780.015643]  dump_stack+0x67/0x9b
[ 1780.015656]  dump_header+0x60/0x42e
[ 1780.015670]  ? _raw_spin_unlock_irqrestore+0x39/0x60
[ 1780.015685]  oom_kill_process+0x2be/0x6d0
[ 1780.015699]  out_of_memory+0x103/0x390
[ 1780.015712]  __alloc_pages_nodemask+0xe3f/0x1250
[ 1780.015735]  new_slab+0x237/0x550
[ 1780.015748]  ___slab_alloc.constprop.34+0x322/0x3e0
[ 1780.015856]  ? alloc_pt+0x22/0x60 [i915]
[ 1780.015871]  ? _set_pages_array+0x122/0x130
[ 1780.015885]  ? lock_acquire+0xa6/0x210
[ 1780.015969]  ? alloc_pt+0x22/0x60 [i915]
[ 1780.015981]  ? __slab_alloc.isra.27.constprop.33+0x3d/0x70
[ 1780.015994]  __slab_alloc.isra.27.constprop.33+0x3d/0x70
[ 1780.016077]  ? alloc_pt+0x22/0x60 [i915]
[ 1780.016088]  kmem_cache_alloc_trace+0x246/0x2e0
[ 1780.016170]  alloc_pt+0x22/0x60 [i915]
[ 1780.016251]  gen8_ppgtt_alloc_pdp+0x16f/0x490 [i915]
[ 1780.016336]  gen8_ppgtt_alloc_4lvl+0x5a/0x150 [i915]
[ 1780.016420]  igt_ppgtt_alloc+0xe7/0x1d0 [i915]
[ 1780.016515]  __i915_subtests+0x44/0xd0 [i915]
[ 1780.016606]  __run_selftests+0x10b/0x190 [i915]
[ 1780.016694]  i915_live_selftests+0x2c/0x60 [i915]
[ 1780.016773]  i915_pci_probe+0x3b/0x90 [i915]
[ 1780.016789]  pci_device_probe+0xa1/0x130
[ 1780.016803]  driver_probe_device+0x306/0x480
[ 1780.016816]  __driver_attach+0xb7/0xe0
[ 1780.016827]  ? driver_probe_device+0x480/0x480
[ 1780.016839]  ? driver_probe_device+0x480/0x480
[ 1780.016851]  bus_for_each_dev+0x74/0xc0
[ 1780.016864]  bus_add_driver+0x15f/0x250
[ 1780.016875]  ? 0xffffffffa0757000
[ 1780.016886]  driver_register+0x52/0xc0
[ 1780.016896]  ? 0xffffffffa0757000
[ 1780.016906]  do_one_initcall+0x58/0x370
[ 1780.016919]  ? kmem_cache_alloc_trace+0x209/0x2e0
[ 1780.016934]  do_init_module+0x56/0x1ea
[ 1780.016946]  load_module+0x2435/0x2b20
[ 1780.016969]  ? __se_sys_finit_module+0xd3/0xf0
[ 1780.016980]  __se_sys_finit_module+0xd3/0xf0
[ 1780.017000]  do_syscall_64+0x55/0x190
[ 1780.017011]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1780.017024] RIP: 0033:0x7f0e2f2fe839
[ 1780.017034] RSP: 002b:00007ffdf9cc9e48 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
[ 1780.017052] RAX: ffffffffffffffda RBX: 000055c5fa2fd110 RCX: 00007f0e2f2fe839
[ 1780.017067] RDX: 0000000000000000 RSI: 000055c5fa2fdf20 RDI: 0000000000000004
[ 1780.017082] RBP: 000055c5fa2fdf20 R08: 0000000000000004 R09: 0000000000000000
[ 1780.017097] R10: 00007ffdf9cc9fb0 R11: 0000000000000246 R12: 0000000000000000
[ 1780.017111] R13: 000055c5fa2f7200 R14: 0000000000000000 R15: 0000000000000037
[ 1780.026373] Mem-Info:
[ 1780.026389] active_anon:37 inactive_anon:0 isolated_anon:0
                active_file:107 inactive_file:0 isolated_file:0
                unevictable:0 dirty:4 writeback:0 unstable:0
                slab_reclaimable:6280 slab_unreclaimable:182609
                mapped:80 shmem:65 pagetables:1719 bounce:0
                free:26771 free_pcp:0 free_cma:0
[ 1780.026452] Node 0 active_anon:148kB inactive_anon:0kB active_file:428kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:320kB dirty:16kB writeback:0kB shmem:260kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 4096kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1780.026504] DMA free:15896kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1780.026549] lowmem_reserve[]: 0 1768 7770 7770
[ 1780.026570] DMA32 free:39192kB min:15348kB low:19184kB high:23020kB active_anon:12kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1926488kB managed:1814684kB mlocked:0kB kernel_stack:48kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1780.026616] lowmem_reserve[]: 0 0 6001 6001
[ 1780.026637] Normal free:51996kB min:52096kB low:65120kB high:78144kB active_anon:148kB inactive_anon:360kB active_file:284kB inactive_file:864kB unevictable:0kB writepending:0kB present:6291456kB managed:6145908kB mlocked:0kB kernel_stack:2960kB pagetables:6876kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1780.026685] lowmem_reserve[]: 0 0 0 0
[ 1780.026700] DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
[ 1780.026757] DMA32: 9*4kB (M) 11*8kB (M) 14*16kB (UM) 7*32kB (M) 13*64kB (UM) 10*128kB (UM) 12*256kB (UM) 10*512kB (M) 8*1024kB (M) 8*2048kB (ME) 1*4096kB (M) = 39548kB
[ 1780.026817] Normal: 1337*4kB (UME) 854*8kB (UME) 429*16kB (UME) 299*32kB (UME) 146*64kB (UM) 34*128kB (UM) 11*256kB (UM) 3*512kB (UM) 2*1024kB (U) 2*2048kB (M) 0*4096kB = 52804kB
[ 1780.026897] 237 total pagecache pages
[ 1780.026908] 8 pages in swap cache
[ 1780.026917] Swap cache stats: add 75013, delete 75006, find 87/131
[ 1780.026930] Free swap  = 1796860kB
[ 1780.026938] Total swap = 2097148kB
[ 1780.026947] 2058482 pages RAM
[ 1780.026955] 0 pages HighMem/MovableOnly
[ 1780.026964] 64360 pages reserved
[ 1780.029494] Out of memory: Kill process 12201 (drv_selftest) score 1000 or sacrifice child
[ 1780.029519] Killed process 12201 (drv_selftest) total-vm:280136kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
[ 1780.162035] 1 and 0 pages still available in the bound and unbound GPU page lists.
[ 1780.162135] python3 invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
[ 1780.162160] CPU: 2 PID: 7705 Comm: python3 Tainted: G     U            4.17.0-rc4-CI-CI_DRM_4158+ #1
[ 1780.162179] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0047.2018.0108.1419 01/08/2018
[ 1780.162196] Call Trace:
[ 1780.162211]  dump_stack+0x67/0x9b
[ 1780.162224]  dump_header+0x60/0x42e
[ 1780.162238]  ? _raw_spin_unlock_irqrestore+0x39/0x60
[ 1780.162252]  oom_kill_process+0x2be/0x6d0
[ 1780.162266]  out_of_memory+0x103/0x390
[ 1780.162279]  __alloc_pages_nodemask+0xe3f/0x1250
[ 1780.162295]  ? lock_acquire+0xa6/0x210
[ 1780.162313]  __read_swap_cache_async+0x148/0x260
[ 1780.162328]  swapin_readahead+0x312/0x410
[ 1780.162342]  ? pagecache_get_page+0x2b/0x210
[ 1780.162356]  ? do_swap_page+0x2e2/0x910
[ 1780.162367]  do_swap_page+0x2e2/0x910
[ 1780.162381]  __handle_mm_fault+0x65e/0xe30
[ 1780.162398]  handle_mm_fault+0x196/0x3a0
[ 1780.162413]  __do_page_fault+0x295/0x590
[ 1780.162428]  page_fault+0x1e/0x30
[ 1780.162440] RIP: 0010:copy_user_generic_unrolled+0x89/0xc0
[ 1780.162452] RSP: 0000:ffffc90001137e60 EFLAGS: 00050202
[ 1780.162466] RAX: 00007ffffffff000 RBX: 0000000000000010 RCX: 0000000000000002
[ 1780.162481] RDX: 0000000000000000 RSI: ffffc90001137e98 RDI: 00007f47279ec4c0
[ 1780.162496] RBP: 00007f47279ec4c0 R08: 0000000000000000 R09: 0000000000000000
[ 1780.162511] R10: ffffc90001137de0 R11: ffff88014ae6ab48 R12: ffffc90001137e98
[ 1780.162525] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[ 1780.162552]  _copy_to_user+0x56/0x70
[ 1780.162565]  poll_select_copy_remaining+0xda/0x140
[ 1780.162580]  kern_select+0xc2/0x100
[ 1780.162593]  __x64_sys_select+0x1b/0x20
[ 1780.162604]  do_syscall_64+0x55/0x190
[ 1780.162615]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1780.162628] RIP: 0033:0x7f473569303f
[ 1780.162637] RSP: 002b:00007f47279ec460 EFLAGS: 00000293 ORIG_RAX: 0000000000000017
[ 1780.162655] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f473569303f
[ 1780.162670] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 1780.162684] RBP: 0000000000000000 R08: 00007f47279ec4c0 R09: 0000000000000000
[ 1780.162699] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[ 1780.162714] R13: 0000000000000000 R14: 00007f47279ec4c0 R15: 00007f472eadade0
[ 1780.162751] Mem-Info:
[ 1780.162764] active_anon:37 inactive_anon:0 isolated_anon:0
                active_file:107 inactive_file:0 isolated_file:0
                unevictable:0 dirty:4 writeback:0 unstable:0
                slab_reclaimable:6280 slab_unreclaimable:183385
                mapped:80 shmem:65 pagetables:1719 bounce:0
                free:18225 free_pcp:60 free_cma:0
[ 1780.162828] Node 0 active_anon:148kB inactive_anon:0kB active_file:428kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:320kB dirty:16kB writeback:0kB shmem:260kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 4096kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[ 1780.162879] DMA free:15896kB min:132kB low:164kB high:196kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15984kB managed:15896kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 1780.162923] lowmem_reserve[]: 0 1768 7770 7770
[ 1780.162944] DMA32 free:31564kB min:15348kB low:19184kB high:23020kB active_anon:12kB inactive_anon:160kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:1926488kB managed:1814684kB mlocked:0kB kernel_stack:48kB pagetables:0kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[ 1780.162991] lowmem_reserve[]: 0 0 6001 6001
[ 1780.163011] Normal free:25440kB min:52096kB low:65120kB high:78144kB active_anon:148kB inactive_anon:360kB active_file:284kB inactive_file:864kB unevictable:0kB writepending:0kB present:6291456kB managed:6145908kB mlocked:0kB kernel_stack:2960kB pagetables:6876kB bounce:0kB free_pcp:120kB local_pcp:120kB free_cma:0kB
[ 1780.163059] lowmem_reserve[]: 0 0 0 0
[ 1780.163075] DMA: 2*4kB (U) 2*8kB (U) 2*16kB (U) 3*32kB (U) 2*64kB (U) 0*128kB 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15896kB
[ 1780.163131] DMA32: 10*4kB (UM) 11*8kB (M) 13*16kB (UM) 7*32kB (M) 12*64kB (M) 9*128kB (M) 11*256kB (M) 10*512kB (M) 9*1024kB (UM) 6*2048kB (M) 0*4096kB = 31920kB
[ 1780.163189] Normal: 888*4kB (ME) 618*8kB (ME) 313*16kB (UME) 198*32kB (UME) 73*64kB (M) 12*128kB (UM) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 26048kB
[ 1780.163260] 237 total pagecache pages
[ 1780.163271] 8 pages in swap cache
[ 1780.163281] Swap cache stats: add 75013, delete 75006, find 87/132
[ 1780.163294] Free swap  = 1798396kB
[ 1780.163302] Total swap = 2097148kB
[ 1780.163311] 2058482 pages RAM
[ 1780.163320] 0 pages HighMem/MovableOnly
[ 1780.163329] 64360 pages reserved
[ 1780.165847] Out of memory: Kill process 874 (java) score 13 or sacrifice child
[ 1780.165887] Killed process 7370 (bash) total-vm:14104kB, anon-rss:0kB, file-rss:4kB, shmem-rss:0kB
[ 1792.569817] i915: probe of 0000:00:02.0 failed with error -4
Comment 1 Chris Wilson 2018-05-22 11:10:43 UTC
Suppression applied,

commit 1abb70f5955d1a9021f96359a2c6502ca569b68d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 22 09:36:43 2018 +0100

    drm/i915/gtt: Allow pagedirectory allocations to fail
    
    As we handle the allocation failure of the page directory and tables by
    propagating the failure back to userspace, allow it to fail if direct
    reclaim is unable to satisfy the request (i.e. disable the oomkiller).
    The premise being that if we are unable to allocate a single page for
    the pagetable, we will not be able to handle the multitude of pages
    required for the gfx operation and we should back off to allow the
    system to recover.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106609
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Matthew Auld <matthew.william.auld@gmail.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180522083643.29601-1-chris@chris-wilson.co.uk
Comment 2 Martin Peres 2018-05-22 21:05:00 UTC
(In reply to Chris Wilson from comment #1)
> Suppression applied,
> 
> commit 1abb70f5955d1a9021f96359a2c6502ca569b68d
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue May 22 09:36:43 2018 +0100
> 
>     drm/i915/gtt: Allow pagedirectory allocations to fail
>     
>     As we handle the allocation failure of the page directory and tables by
>     propagating the failure back to userspace, allow it to fail if direct
>     reclaim is unable to satisfy the request (i.e. disable the oomkiller).
>     The premise being that if we are unable to allocate a single page for
>     the pagetable, we will not be able to handle the multitude of pages
>     required for the gfx operation and we should back off to allow the
>     system to recover.
>     
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=106609
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Matthew Auld <matthew.william.auld@gmail.com>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Reviewed-by: Matthew Auld <matthew.william.auld@gmail.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180522083643.29601-1-
> chris@chris-wilson.co.uk

I'll trust you on that!
Comment 3 Chris Wilson 2018-05-22 21:16:58 UTC
Note that the oom still exists; just the oomkiller shouldn't be triggered directly by our code ;) #105347 is a good dumping ground for the oom by java or avahi-daemon etc that unfortunately try to run concurrently to the test that is eating all the memory.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.